pith. sign in

arxiv: 2606.06473 · v1 · pith:MFIGT5G3new · submitted 2026-06-04 · 💻 cs.AI · cs.CL

MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

Pith reviewed 2026-06-28 00:54 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords automated machine learningLLM agentsalgorithm discoveryself-evolving frameworksmulti-agent systemsMLE-Benchtree searchmemory retrieval
0
0 comments X

The pith

MLEvolve lets LLM agents discover machine learning algorithms by sharing information across search branches and reusing past experience.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current LLM agents for machine learning engineering tasks suffer from isolated search branches that cannot share findings, searches that forget prior attempts, and weak separation between high-level planning and low-level code writing. MLEvolve introduces three targeted fixes: Progressive MCGS extends ordinary tree search with graph edges that link branches and gradually narrows focus using an entropy schedule; Retrospective Memory stores both fixed domain knowledge and dynamic task experience for later retrieval; adaptive coding modes keep strategic decisions separate from code generation. These changes produce higher medal rates and valid submission rates on MLE-Bench even when the time budget is cut in half, and they also beat specialized methods on mathematical algorithm tasks. A reader would care because the work shows one concrete route toward agents that can keep improving on long engineering problems without repeated human resets.

Core claim

MLEvolve is an LLM-based self-evolving multi-agent framework that extends tree search to Progressive MCGS for cross-branch information flow, adds Retrospective Memory for experience retrieval and reuse, and decouples strategic planning from code generation via adaptive coding modes, yielding state-of-the-art average medal rate and valid submission rate on MLE-Bench under a 12-hour budget while also outperforming AlphaEvolve on mathematical algorithm optimization.

What carries the argument

Progressive MCGS, which augments tree search with graph-based reference edges for cross-branch flow and applies an entropy-inspired progressive schedule to move from broad exploration to focused exploitation.

If this is right

  • Higher average medal rate and valid submission rate on MLE-Bench when restricted to a 12-hour budget.
  • Better performance than specialized algorithm discovery methods such as AlphaEvolve on mathematical optimization tasks.
  • Demonstrated cross-domain generalization from machine learning engineering to mathematical algorithm discovery.
  • Sustained self-evolution over long-horizon tasks through accumulated experience reuse.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cross-branch and memory mechanisms could be tested on other long-horizon LLM tasks such as automated scientific experiment design.
  • If Retrospective Memory continues to scale without degradation, the framework may support multi-week iterative discovery runs without external resets.
  • Disabling the progressive entropy schedule while keeping the graph edges would isolate whether the exploration-to-exploitation shift is necessary for the reported gains.

Load-bearing premise

The measured performance gains come from the cross-branch edges, retrospective memory retrieval, and adaptive coding modes rather than from other aspects of the implementation.

What would settle it

An ablation that removes the cross-branch reference edges or the dynamic memory component and measures no reduction in medal rate or valid submission rate on MLE-Bench.

read the original abstract

Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery and machine learning engineering (MLE), where sustained self-evolution becomes a key capability. However, existing MLE agents suffer from inter-branch information isolation, memoryless search, and lack of hierarchical control, which together hinder long-horizon optimization. We present MLEvolve, an LLM-based self-evolving multi-agent framework for end-to-end machine learning algorithm discovery. By extending tree search to Progressive MCGS, MLEvolve enables cross-branch information flow through graph-based reference edges and gradually shifts the search from broad exploration to focused exploitation with an entropy-inspired progressive schedule. To allow the agent to evolve with accumulated experience, we introduce Retrospective Memory, which combines a cold-start domain knowledge base with a dynamic global memory for task-specific experience retrieval and reuse. For stable long-horizon iteration, we further decouple strategic planning from code generation with adaptive coding modes. Evaluation on MLE-Bench shows that MLEvolve achieves state-of-the-art performance across multiple dimensions including average medal rate and valid submission rate under a 12-hour budget (half the standard runtime). Moreover, MLEvolve also outperforms specialized algorithm discovery methods including AlphaEvolve on mathematical algorithm optimization tasks, demonstrating strong cross-domain generalization. Our code is available at https://github.com/InternScience/MLEvolve.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces MLEvolve, an LLM-based multi-agent framework for end-to-end machine learning algorithm discovery. It extends tree search via Progressive MCGS to enable cross-branch information flow through graph edges and an entropy-inspired progressive schedule, introduces Retrospective Memory combining a cold-start knowledge base with dynamic global memory for experience retrieval, and decouples strategic planning from code generation using adaptive coding modes. The central empirical claim is that the full system achieves state-of-the-art average medal rate and valid submission rate on MLE-Bench under a 12-hour budget (half the standard runtime) and outperforms specialized methods including AlphaEvolve on mathematical algorithm optimization tasks.

Significance. If the performance claims hold and the gains can be attributed to the three proposed mechanisms, the work would constitute a meaningful step forward in self-evolving LLM agents for long-horizon MLE tasks by addressing inter-branch isolation and memoryless search. The public release of code at the cited GitHub repository is a clear strength that supports reproducibility and follow-on work.

major comments (1)
  1. [Evaluation] Evaluation section: the central claim attributes the reported SOTA medal rate, valid submission rate, and outperformance of AlphaEvolve specifically to Progressive MCGS cross-branch edges, Retrospective Memory, and adaptive coding modes. However, only the full system versus external baselines is reported; no ablation that removes or disables any one of these three components while keeping the remainder fixed is described. Without such controls it is not possible to rule out that the observed gains arise from the LLM backbone, total token budget, or prompt engineering rather than the cited algorithmic additions.
minor comments (1)
  1. [Abstract] Abstract: performance numbers are stated without any accompanying statistical details, run counts, variance, or baseline definitions, which reduces the ability of readers to assess the claims at a glance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments. We address the concern about the lack of ablation studies below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the central claim attributes the reported SOTA medal rate, valid submission rate, and outperformance of AlphaEvolve specifically to Progressive MCGS cross-branch edges, Retrospective Memory, and adaptive coding modes. However, only the full system versus external baselines is reported; no ablation that removes or disables any one of these three components while keeping the remainder fixed is described. Without such controls it is not possible to rule out that the observed gains arise from the LLM backbone, total token budget, or prompt engineering rather than the cited algorithmic additions.

    Authors: We agree that the current evaluation reports only full-system results against external baselines and does not include internal ablations that isolate Progressive MCGS, Retrospective Memory, or adaptive coding modes. This limits the strength of causal attribution. In the revised manuscript we will add ablation experiments that disable each component individually while holding the LLM backbone, token budget, and other elements fixed, reporting the resulting changes in medal rate and submission rate on MLE-Bench. revision: yes

Circularity Check

0 steps flagged

No circularity; framework is descriptive with no derivation chain

full rationale

The manuscript is a systems description of an LLM agent framework. It contains no equations, no fitted parameters, no 'predictions' of quantities derived from inputs, and no self-citation load-bearing uniqueness theorems. All performance claims are empirical comparisons on MLE-Bench; the three named mechanisms are presented as design choices whose contribution is asserted by the full-system results rather than by any algebraic reduction to the inputs. This matches the default case of a self-contained empirical paper with score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, parameters, or background assumptions; therefore the ledger is empty.

pith-pipeline@v0.9.1-grok · 5820 in / 1168 out tokens · 31885 ms · 2026-06-28T00:54:17.287666+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Agents-K1: Towards Agent-native Knowledge Orchestration

    cs.AI 2026-06 unverdicted novelty 5.0

    Agents-K1 builds agent-native scientific knowledge graphs from full papers via a multimodal parser, 4B GRPO-trained extractor, and tri-source graph interface, applied to 2.46M papers yielding Scholar-KG.

Reference graph

Works this paper leans on

38 extracted references · 8 linked inside Pith · cited by 1 Pith paper

  1. [1]

    AI and science: what 1,600 researchers think

    Richard Van Noorden and Jeffrey M Perkel. “AI and science: what 1,600 researchers think”. In: Nature621.7980 (2023), pp. 672–675

  2. [2]

    A survey on the optimization of large language model-based agents

    Shangheng Du et al. “A survey on the optimization of large language model-based agents”. In: ACM Computing Surveys58.9 (2026), pp. 1–37. 13

  3. [3]

    Towards end-to-end automation of AI research

    Chris Lu et al. “Towards end-to-end automation of AI research”. In:Nature651.8107 (2026), pp. 914–919

  4. [4]

    NovelSeek: When Agent Becomes the Scientist–Building Closed-Loop System from Hypothesis to Verification

    NovelSeek Team et al. “NovelSeek: When Agent Becomes the Scientist–Building Closed-Loop System from Hypothesis to Verification”. In:arXiv preprint arXiv:2505.16938(2025)

  5. [5]

    Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery

    Shiyang Feng et al. “Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery”. In:arXiv preprint arXiv:2602.08990(2026)

  6. [6]

    Alphaevolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov et al. “Alphaevolve: A coding agent for scientific and algorithmic discovery”. In:arXiv preprint arXiv:2506.13131(2025)

  7. [7]

    Software engineering for machine learning: A case study

    Saleema Amershi et al. “Software engineering for machine learning: A case study”. In:2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE. 2019, pp. 291–300

  8. [8]

    AutoML: A survey of the state-of-the-art

    Xin He, Kaiyong Zhao, and Xiaowen Chu. “AutoML: A survey of the state-of-the-art”. In: Knowledge-based systems212 (2021), p. 106622

  9. [9]

    Auto-sklearn 2.0: Hands-free automl via meta-learning

    Matthias Feurer et al. “Auto-sklearn 2.0: Hands-free automl via meta-learning”. In:Journal of Machine Learning Research23.261 (2022), pp. 1–61

  10. [10]

    Openhands: An open platform for ai software developers as generalist agents

    Xingyao Wang et al. “Openhands: An open platform for ai software developers as generalist agents”. In:International Conference on Learning Representations. Vol. 2025. 2025, pp. 65882– 65919

  11. [11]

    Mlagentbench: Evaluating language agents on machine learning experi- mentation

    Qian Huang et al. “Mlagentbench: Evaluating language agents on machine learning experi- mentation”. In:arXiv preprint arXiv:2310.03302(2023)

  12. [12]

    Aide: Ai-driven exploration in the space of code

    Zhengyao Jiang et al. “Aide: Ai-driven exploration in the space of code”. In:arXiv preprint arXiv:2502.13138(2025)

  13. [13]

    R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science

    Xu Yang et al. “R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science”. In:arXiv preprint arXiv:2505.14738(2025)

  14. [14]

    ML-Master: Towards AI-for-AI via Integration of Exploration and Reasoning

    Zexi Liu et al. “ML-Master: Towards AI-for-AI via Integration of Exploration and Reasoning”. In:arXiv preprint arXiv:2506.16499(2025)

  15. [15]

    The fm agent

    Annan Li et al. “The fm agent”. In:arXiv preprint arXiv:2510.26144(2025)

  16. [16]

    AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench

    Edan Toledo et al. “AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench”. In:arXiv preprint arXiv:2507.02554(2025)

  17. [17]

    AutoMind: Adaptive Knowledgeable Agent for Automated Data Science

    Yixin Ou et al. “AutoMind: Adaptive Knowledgeable Agent for Automated Data Science”. In: arXiv preprint arXiv:2506.10974(2025)

  18. [18]

    MARS: Modular Agent with Reflective Search for Automated AI Research

    Jiefeng Chen et al. “MARS: Modular Agent with Reflective Search for Automated AI Research”. In:arXiv preprint arXiv:2602.02660(2026)

  19. [19]

    Toward ultra-long-horizon agentic science: Cognitive accumulation for machine learning engineering

    Xinyu Zhu et al. “Toward ultra-long-horizon agentic science: Cognitive accumulation for machine learning engineering”. In:arXiv preprint arXiv:2601.10402(2026)

  20. [20]

    AutoMLGen: Navigating Fine-Grained Optimization for Coding Agents

    Shangheng Du et al. “AutoMLGen: Navigating Fine-Grained Optimization for Coding Agents”. In:arXiv preprint arXiv:2510.08511(2025)

  21. [21]

    Mathematical exploration and discovery at scale

    Bogdan Georgiev et al. “Mathematical exploration and discovery at scale”. In:arXiv preprint arXiv:2511.02864(2025)

  22. [22]

    MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement

    Jaehyun Nam et al. “MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement”. In:arXiv preprint arXiv:2506.15692(2025)

  23. [23]

    Mlzero: A multi-agent system for end-to-end machine learning automa- tion

    Haoyang Fang et al. “Mlzero: A multi-agent system for end-to-end machine learning automa- tion”. In:arXiv preprint arXiv:2505.13941(2025). 14

  24. [24]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Jun Shern Chan et al. “MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering”. In:The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. 2025

  25. [25]

    AIBuildAI: An AI Agent for Automatically Building AI Models

    Ruiyi Zhang et al. “AIBuildAI: An AI Agent for Automatically Building AI Models”. In:arXiv preprint arXiv:2604.14455(2026)

  26. [26]

    KAPSO: A Knowledge- grounded framework for Autonomous Program Synthesis and Optimization

    Alireza Nadafian, Alireza Mohammadshahi, and Majid Yazdani. “KAPSO: A Knowledge- grounded framework for Autonomous Program Synthesis and Optimization”. In:arXiv preprint arXiv:2601.21526(2026)

  27. [27]

    Monte-Carlo graph search for Alp- haZero

    Johannes Czech, Patrick Korus, and Kristian Kersting. “Monte-Carlo graph search for Alp- haZero”. In:arXiv preprint arXiv:2012.11045(2020)

  28. [28]

    Monte-carlo graph search: the value of merging similar states

    Edouard Leurent and Odalric-Ambrym Maillard. “Monte-carlo graph search: the value of merging similar states”. In:Asian Conference on Machine Learning. PMLR. 2020, pp. 577–592

  29. [29]

    Locagent: Graph-guided llm agents for code localization

    Zhaoling Chen et al. “Locagent: Graph-guided llm agents for code localization”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, pp. 8697–8727

  30. [30]

    Codexgraph: Bridging large language models and code repositories via code graph databases

    Xiangyan Liu et al. “Codexgraph: Bridging large language models and code repositories via code graph databases”. In:Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025, pp. 142–160

  31. [31]

    A survey on the memory mechanism of large language model-based agents

    Zeyu Zhang et al. “A survey on the memory mechanism of large language model-based agents”. In:ACM Transactions on Information Systems43.6 (2025), pp. 1–47

  32. [32]

    A-mem: Agentic memory for llm agents

    Wujiang Xu et al. “A-mem: Agentic memory for llm agents”. In:Advances in Neural Information Processing Systems38 (2026), pp. 17577–17604

  33. [33]

    Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search

    Yifei Zhang et al. “Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search”. In:arXiv preprint arXiv:2603.01692(2026)

  34. [34]

    Information theory and statistical mechanics

    Edwin T Jaynes. “Information theory and statistical mechanics”. In:Physical review106.4 (1957), p. 620

  35. [35]

    Billion-scale similarity search with GPUs

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. “Billion-scale similarity search with GPUs”. In: IEEE transactions on big data7.3 (2019), pp. 535–547

  36. [36]

    Evaluation-driven Scaling for Scientific Discovery

    Haotian Ye et al. “Evaluation-driven Scaling for Scientific Discovery”. In:arXiv preprint arXiv:2604.19341(2026)

  37. [37]

    Learning to discover at test time

    Mert Yuksekgonul et al. “Learning to discover at test time”. In:arXiv preprint arXiv:2601.16175 (2026)

  38. [38]

    Task":"aptos2019-blindness-detection

    Asankhaya Sharma.OpenEvolve: an open-source evolutionary coding agent. 2025.url: https: //github.com/algorithmicsuperintelligence/openevolve. 15 Appendix A. Agent Descriptions MLEvolve is realized through a team of specialized agents, each tailored to a specific search phase or operator type. We summarize their roles: • Draft Agent.Generates initial candi...