arxiv: 2604.14655 · v2 · submitted 2026-04-16 · 💻 cs.AI · cs.LG

Recognition: no theorem link

AgentGA: Evolving Code Solutions in Agent-Seed Space

David Y.Y. Tan , Kellie Chin , Jingxian Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:08 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords agent seed optimizationgenetic algorithmautonomous code generationparent archive inheritancetabular AutoMLKaggle benchmarklong-horizon agentselite tournament selection

0 comments

The pith

Optimizing the agent seed—starting prompt plus inherited parent archives—lets a genetic algorithm evolve superior autonomous code-generation runs for tabular AutoML.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether searching over reusable starting conditions for agent runs, rather than editing generated code, improves outcomes on real benchmarks. AgentGA runs a population-level genetic algorithm that launches fresh agent executions each generation while passing selected archives forward so descendants can inspect and reuse prior artifacts. On the 16-competition Weco-Kaggle Lite suite this yields an average 71.90 percent exceed-human rate and wins 15 competitions outright, with archive-conditioned descendants beating de-novo proposals in 51.9 percent of direct tournaments. The results indicate that treating the initial seed as the evolvable unit is a workable design choice when agents already perform long-horizon autonomous search.

Core claim

AgentGA couples a genetic algorithm with long-horizon agents by evolving the agent seed—the task prompt together with optional parent archives that initialize each fresh workspace—rather than mutating code directly. Selection proceeds via deterministic 1:1 elite tournaments, operator allocation adapts online via a modified Hedge controller, and each generation launches an isolated autonomous run. On the full 16-competition benchmark the method records a 71.90 percent average exceed-human rate against 51.38 percent for the AIDE reference and wins 15 of 16 tasks; within runs, archive-inheriting descendants prevail in 51.9 percent of 1,680 parent-child tournaments while de-novo proposals win 8.

What carries the argument

The agent seed: the task prompt plus optional parent archives that initialize a fresh workspace, which serves as the evolvable unit allowing inheritance of artifacts across generations without direct code editing.

If this is right

Descendants that inherit parent archives outperform fresh proposals in the majority of direct comparisons.
A population-level genetic algorithm with elite tournaments and online operator allocation can be applied to long-horizon autonomous agents.
Agent-seed optimization constitutes a practical design choice for autonomous code-search systems on tabular AutoML tasks.
The approach wins 15 of 16 competitions on the Weco-Kaggle Lite benchmark at an average 71.90 percent exceed-human rate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same seed-evolution pattern could be tested on domains other than tabular data, such as image or text generation pipelines, to check whether workspace inheritance transfers.
If archive reuse proves robust, future systems might reduce population size by focusing compute on promising inherited lineages rather than broad random restarts.
Separating prompt evolution from workspace-state evolution would clarify which component drives most of the observed tournament advantage.
The method suggests that agent frameworks could benefit from explicit mechanisms to log and replay successful initialization states across independent runs.

Load-bearing premise

The reported performance gains arise specifically from optimizing and inheriting agent seeds rather than from unstated details of the underlying agent implementation, benchmark tuning, or selective reporting of runs.

What would settle it

A controlled experiment that keeps the same agent implementation and total compute budget but disables seed optimization and inheritance, then measures whether the exceed-human rate falls back to the 51.38 percent reference level.

Figures

Figures reproduced from arXiv: 2604.14655 by David Y.Y. Tan, Jingxian Zhang, Kellie Chin.

**Figure 2.** Figure 2: Tournament outcomes by task across these benchmark runs. Each bar shows the share [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Best-score progression over the tabular-playground-series-aug-2022 run. [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

**Figure 4.** Figure 4: Parent-child structure for the tabular-playground-series-aug-2022 run, showing explicit [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

read the original abstract

We present AgentGA, a framework that evolves autonomous code-generation runs by optimizing the agent seed: the task prompt plus optional parent archives that initialize a fresh workspace. The outer loop searches over these reusable starting conditions rather than editing code directly. Each generation launches a fresh autonomous run in an isolated workspace, while selected parent archives provide inherited artifacts that descendants can inspect and reuse. AgentGA couples a population-level genetic algorithm with long-horizon agents; selection uses deterministic 1:1 elite tournaments and operator allocation is adapted online with a modified Hedge controller. We instantiate the approach for tabular AutoML on the 16-competition Weco-Kaggle Lite benchmark. Across the full benchmark, AgentGA averages 71.90% Exceeds % of Human versus 51.38% for the AIDE reference, winning 15/16 competitions. Within AgentGA runs, descendants conditioned on inherited parent archives win 51.9% of 1,680 parent-child tournaments versus 8.6% for de novo proposals. These results support agent-seed optimization as a practical design choice for autonomous code-search systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgentGA gets better benchmark numbers than AIDE by evolving prompts plus inherited archives, but the abstract leaves the base agent and run protocol unspecified so the attribution stays shaky.

read the letter

The paper's core move is to run a genetic algorithm over agent seeds—task prompt plus optional parent archives—rather than mutating code inside a single run. Each generation starts a fresh isolated workspace, and selected archives get passed down so descendants can reuse prior artifacts. They pair this with online Hedge adaptation for choosing operators and run it on 16 tabular AutoML tasks from Weco-Kaggle Lite. The headline numbers are 71.9 % exceeds-human versus 51.4 % for AIDE, plus 15/16 wins, and an internal check showing archive-conditioned descendants beat de-novo proposals 51.9 % to 8.6 % in 1,680 tournaments. That internal split is the cleanest piece of evidence they offer for the inheritance mechanism actually helping. The framing itself is a modest but distinct shift from standard code-level search; it treats the long-horizon agent as a reusable black box whose starting conditions can be optimized separately. The work is honest about using deterministic elite tournaments and a modified Hedge controller, which keeps the outer loop simple. The soft spot is exactly the one the stress-test flags: nothing in the abstract confirms that the underlying agent, workspace rules, or benchmark execution match AIDE except for the outer GA loop. Without that equivalence, the 20-point gap could come from implementation differences, cherry-picked runs, or tuning rather than seed evolution. The abstract also skips any description of how many independent runs were done, how outliers were handled, or what statistical tests were applied, so the numbers cannot be assessed for robustness. This is the kind of paper that belongs in a reading group focused on agentic AutoML or evolutionary methods for code agents, mainly to see whether the full methods section closes the attribution gap. It is worth sending to peer review because the idea is concrete and the internal tournament result is falsifiable; referees can ask for the missing controls and base-agent details without the work being incoherent on its own terms.

Referee Report

3 major / 1 minor

Summary. The paper presents AgentGA, a framework that evolves autonomous code-generation agents by optimizing the agent seed (task prompt plus optional parent archives) via an outer genetic algorithm loop rather than direct code edits. Each generation runs a fresh long-horizon agent in an isolated workspace, with selection via deterministic 1:1 elite tournaments and online Hedge-based operator adaptation. On the 16-competition Weco-Kaggle Lite tabular AutoML benchmark, AgentGA achieves 71.90% average Exceeds % of Human versus 51.38% for the AIDE reference (winning 15/16 competitions); within AgentGA, inherited-archive descendants win 51.9% of 1,680 parent-child tournaments versus 8.6% for de novo proposals.

Significance. If the base-agent equivalence and experimental controls hold, the work provides evidence that optimizing reusable starting conditions can improve performance in long-horizon code-search systems, offering a practical alternative or complement to direct code mutation. The internal tournament design supplies a within-method control that partially isolates the inheritance effect, and the deterministic selection plus adaptive operators are clear methodological strengths.

major comments (3)

[Abstract] Abstract: The central claim that performance gains arise from agent-seed optimization requires that the underlying long-horizon agent, workspace isolation, and benchmark execution are identical between AgentGA and AIDE except for the outer GA loop and inheritance mechanism. No such confirmation or base-agent description is supplied, so the 71.90% vs 51.38% comparison cannot be attributed specifically to seed evolution.
[Abstract] Abstract: The headline results (71.90% Exceeds % of Human, 15/16 wins) are stated without any experimental protocol, run-selection criteria, statistical tests, or description of how runs were chosen or excluded. This absence makes the benchmark outcomes impossible to assess for support of the central claim.
[Abstract] Abstract: The internal 51.9% vs 8.6% tournament statistic is computed only within AgentGA runs and therefore does not address potential confounds in the cross-method comparison to AIDE.

minor comments (1)

[Abstract] The abstract would benefit from a brief sentence clarifying the precise definition of 'Exceeds % of Human' and the construction of the Weco-Kaggle Lite benchmark.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the need for clearer experimental controls and protocol details. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that performance gains arise from agent-seed optimization requires that the underlying long-horizon agent, workspace isolation, and benchmark execution are identical between AgentGA and AIDE except for the outer GA loop and inheritance mechanism. No such confirmation or base-agent description is supplied, so the 71.90% vs 51.38% comparison cannot be attributed specifically to seed evolution.

Authors: We agree that explicit confirmation of base-agent equivalence is required to support attribution of gains to seed optimization. The manuscript positions AgentGA as an outer loop around the same long-horizon autonomous code-generation agent and isolated workspace used by the AIDE reference, with the sole additions being the genetic algorithm over seeds and parent-archive inheritance. To make this equivalence unambiguous, we will add a dedicated subsection in the Methods section that describes the base agent configuration, workspace isolation protocol, and benchmark execution harness, together with an explicit statement that these elements are held fixed between the two methods. revision: yes
Referee: [Abstract] Abstract: The headline results (71.90% Exceeds % of Human, 15/16 wins) are stated without any experimental protocol, run-selection criteria, statistical tests, or description of how runs were chosen or excluded. This absence makes the benchmark outcomes impossible to assess for support of the central claim.

Authors: We acknowledge that the current manuscript does not supply a full experimental protocol, run-selection criteria, or statistical analysis in the abstract or main text. The reported figures reflect single deterministic runs per competition for each method. We will insert a new Experimental Setup section that details the run protocol, any exclusion rules, computational budget, and the rationale for omitting formal statistical tests (driven by the high cost of long-horizon agent executions). This addition will allow readers to evaluate the strength of the benchmark comparison directly. revision: yes
Referee: [Abstract] Abstract: The internal 51.9% vs 8.6% tournament statistic is computed only within AgentGA runs and therefore does not address potential confounds in the cross-method comparison to AIDE.

Authors: The referee correctly notes that the parent-child tournament results are computed exclusively inside AgentGA runs and therefore cannot serve as a control for confounds between AgentGA and AIDE. This internal statistic is intended only to isolate the contribution of archive inheritance within our own method. The primary cross-method evidence rests on the assumption of identical base agents and benchmarks, which we will make explicit in the revised Methods section as described in our response to the first comment. We will also revise the text to state the limited scope of the tournament analysis. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark comparisons with no derivations or self-referential chains

full rationale

The paper contains no equations, derivations, fitted parameters, or mathematical claims. All results are direct empirical measurements (win rates, tournament outcomes) on the Weco-Kaggle Lite benchmark against an external reference (AIDE). No self-citations, ansatzes, uniqueness theorems, or renamings of known results appear in the provided text. The central claim is an observed performance difference, not a reduction of any output to its inputs by construction. This is the normal case for an empirical systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities beyond the high-level framing of agent-seed space; all details are absent.

pith-pipeline@v0.9.0 · 5494 in / 1002 out tokens · 135314 ms · 2026-05-12T04:08:02.870567+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 6 internal anchors

[1]

SELA : Tree - Search Enhanced LLM Agents for Automated Machine Learning , October 2024

Yizhou Chi, Yizhang Lin, Sirui Hong, Duyi Pan, Yaying Fei, Guanghao Mei, Bangbang Liu, Tianqi Pang, Jacky Kwok, Ceyao Zhang, Bang Liu, and Chenglin Wu. SELA : Tree - Search Enhanced LLM Agents for Automated Machine Learning , October 2024. URL http://arxiv.org/abs/2410.17238. arXiv:2410.17238 [cs]

work page arXiv 2024
[2]

AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander Smola. AutoGluon - Tabular : Robust and Accurate AutoML for Structured Data , March 2020. URL http://arxiv.org/abs/2003.06505. arXiv:2003.06505 [stat]

work page internal anchor Pith review arXiv 2020
[3]

MLZero : A Multi - Agent System for End -to-end Machine Learning Automation

Haoyang Fang, Boran Han, Nick Erickson, Xiyuan Zhang, Su Zhou, Anirudh Dagar, Jiani Zhang, Ali Caner Turkmen, Cuixiong Hu, Huzefa Rangwala, Ying Nian Wu, Bernie Wang, and George Karypis. MLZero : A Multi - Agent System for End -to-end Machine Learning Automation . In Advances in Neural Information Processing Systems 38 ( NeurIPS ) , 2025. doi:10.48550/arX...

work page doi:10.48550/arxiv.2505.13941 2025
[4]

Efficient and Robust Automated Machine Learning

Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and Robust Automated Machine Learning . In Advances in Neural Information Processing Systems , volume 28. Curran Associates, Inc., 2015. URL https://papers.nips.cc/paper/2015/hash/11d0e6287202fced83f79975ec59a3a6-Abstract.html

work page 2015
[5]

Auto- Sklearn 2.0: Hands -free AutoML via Meta - Learning

Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. Auto- Sklearn 2.0: Hands -free AutoML via Meta - Learning . Journal of Machine Learning Research, 23 0 (261): 0 1--61, 2022. ISSN 1533-7928. URL http://jmlr.org/papers/v23/21-0992.html

work page 2022
[6]

Schapire

Yoav Freund and Robert E Schapire. A Decision - Theoretic Generalization of On - Line Learning and an Application to Boosting . Journal of Computer and System Sciences, 55 0 (1): 0 119--139, August 1997. ISSN 0022-0000. doi:10.1006/jcss.1997.1504. URL https://www.sciencedirect.com/science/article/pii/S002200009791504X

work page doi:10.1006/jcss.1997.1504 1997
[7]

Antoine Grosnit, Alexandre Maraval, Refinath S. N, Zichao Zhao, James Doran, Giuseppe Paolo, Albert Thomas, Jonas Gonzalez, Abhineet Kumar, Khyati Khandelwal, Abdelhakim Benechehab, Hamza Cherkaoui, Youssef Attia El-Hili, Kun Shao, Jianye Hao, Jun Yao, Balázs Kégl, Haitham Bou-Ammar, and Jun Wang. Kolb- Based Experiential Learning for Generalist Agents wi...

work page arXiv 2024
[8]

Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532,

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. EvoPrompt : Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers . In Proceedings of the 12th International Conference on Learning Representations ( ICLR ) , 2024 a . doi:10.48550/arXiv.2309.08532. URL http://arxiv.org/abs/2...

work page doi:10.48550/arxiv.2309.08532 2024
[9]

DS - Agent : Automated Data Science by Empowering Large Language Models with Case - Based Reasoning

Siyuan Guo, Cheng Deng, Ying Wen, Hechang Chen, Yi Chang, and Jun Wang. DS - Agent : Automated Data Science by Empowering Large Language Models with Case - Based Reasoning . In Proceedings of the 41st International Conference on Machine Learning ( ICML ) , pages 16813--16848, 2024 b . doi:10.48550/arXiv.2402.17453. URL http://arxiv.org/abs/2402.17453

work page doi:10.48550/arxiv.2402.17453 2024
[10]

Large Language Models for Automated Data Science : Introducing CAAFE for Context - Aware Automated Feature Engineering

Noah Hollmann, Samuel Müller, and Frank Hutter. Large Language Models for Automated Data Science : Introducing CAAFE for Context - Aware Automated Feature Engineering . In Advances in Neural Information Processing Systems 36 ( NeurIPS ) , 2023. doi:10.48550/arXiv.2305.03403. URL http://arxiv.org/abs/2305.03403

work page doi:10.48550/arxiv.2305.03403 2023
[11]

Data interpreter: An llm agent for data science,

Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Ceyao Zhang, Chenxing Wei, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, Li Zhang, Lingyao Zhang, Min Yang, Mingchen Zhuge, Taicheng Guo, Tuo Zhou, Wei Tao, Xiangru Tang, Xiangtao Lu, Xiawu Zheng, Xinbing Liang, Yaying Fei, Yuheng Cheng, Zhibin Gou, Zongze Xu, and Chenglin Wu. Data Interprete...

work page doi:10.48550/arxiv.2402.18679 2025
[12]

Automated Design of Agentic Systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems, 2025. URL https://arxiv.org/abs/2408.08435

work page internal anchor Pith review arXiv 2025
[13]

Aide: Ai-driven exploration in the space of code,

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. AIDE : AI - Driven Exploration in the Space of Code , February 2025. URL http://arxiv.org/abs/2502.13138. arXiv:2502.13138 [cs]

work page arXiv 2025
[14]

and Sutherland Robson, Esme and Kohli, Pushmeet and de Freitas, Nando and Kavukcuoglu, Koray and Vinyals, Oriol , year=

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

work page doi:10.1126/science.abq1158 2022
[15]

AutoKaggle : A Multi - Agent Framework for Autonomous Data Science Competitions

Ziming Li, Qianbo Zang, David Ma, Jiawei Guo, Tuney Zheng, Minghao Liu, Xinyao Niu, Yue Wang, Jian Yang, Jiaheng Liu, Wanjun Zhong, Wangchunshu Zhou, Wenhao Huang, and Ge Zhang. AutoKaggle : A Multi - Agent Framework for Autonomous Data Science Competitions . In Proceedings of the 13th International Conference on Learning Representations ( ICLR ) , 2025. ...

work page doi:10.48550/arxiv.2410.20424 2025
[16]

I- MCTS : Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search , February 2025

Zujie Liang, Feng Wei, Wujiang Xu, Lin Chen, Yuxi Qian, and Xinhui Wu. I- MCTS : Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search , February 2025. URL http://arxiv.org/abs/2502.14693. arXiv:2502.14693 [cs]

work page arXiv 2025
[17]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, and 1 others

Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, Daxin Jiang, Binxing Jiao, Chen Hu, and Huacan Wang. Se-agent: Self-evolution trajectory optimization in multi-step reasoning with llm-based agents, 2025. URL https://arxiv.org/abs/2508.02085

work page arXiv 2025
[18]

Alexander Novikov, Ngân Vũ, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve : A coding agent for scientific and algor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Olson, Nathan Bartley, Ryan J

Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore. Evaluation of a Tree -based Pipeline Optimization Tool for Automating Data Science . In Proceedings of the Genetic and Evolutionary Computation Conference 2016 , GECCO '16, pages 485--492, New York, NY, USA, July 2016. Association for Computing Machinery. ISBN 978-1-4503-4206-3. doi:...

work page doi:10.1145/2908812.2908918 2016
[20]

, author Barekatain, M

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models. Nature, 625 0 (7995): 0 468--475, January 2024. ISSN 1476-468...

work page doi:10.1038/s41586-023-06924-6 2024
[21]

Agentsquare: Automatic llm agent search in modular design space, 2025

Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, and Yong Li. Agentsquare: Automatic llm agent search in modular design space, 2025. URL https://arxiv.org/abs/2410.06153

work page arXiv 2025
[22]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team. Kimi k2.5: Visual agentic intelligence, 2026. URL https://arxiv.org/abs/2602.02276

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Hoos, and Kevin Leyton-Brown

Chris Thornton, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Auto- WEKA : combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining , KDD '13, pages 847--855, New York, NY, USA, August 2013. Association for Computing Machin...

work page doi:10.1145/2487575.2487629 2013
[24]

arXiv preprint arXiv:2410.02958 , year=

Patara Trirat, Wonyong Jeong, and Sung Ju Hwang. AutoML - Agent : A Multi - Agent LLM Framework for Full - Pipeline AutoML . In Proceedings of the 42nd International Conference on Machine Learning ( ICML ) , 2025. doi:10.48550/arXiv.2410.02958. URL http://arxiv.org/abs/2410.02958

work page doi:10.48550/arxiv.2410.02958 2025
[25]

Group-Evolving Agents: Open-Ended Self-Improvement via Experience Sharing, 2026

Zhaotian Weng, Antonis Antoniades, Deepak Nathani, Zhen Zhang, Xiao Pu, and Xin Eric Wang. Group-evolving agents: Open-ended self-improvement via experience sharing, 2026. URL https://arxiv.org/abs/2602.04837

work page arXiv 2026
[26]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct : Synergizing Reasoning and Acting in Language Models . In Proceedings of the 11th International Conference on Learning Representations ( ICLR ) , 2023. URL http://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

ReEvo: Large language models as hyper-heuristics with reflective evolution.arXiv preprint arXiv:2402.01145, 2024

Haoran Ye, Jiarui Wang, Zhiguang Cao, Federico Berto, Chuanbo Hua, Haeyeon Kim, Jinkyoo Park, and Guojie Song. ReEvo : Large Language Models as Hyper - Heuristics with Reflective Evolution . In Advances in Neural Information Processing Systems 37 ( NeurIPS ) , 2024. doi:10.48550/arXiv.2402.01145. URL http://arxiv.org/abs/2402.01145

work page doi:10.48550/arxiv.2402.01145 2024
[28]

arXiv preprint arXiv:2502.07373 , year=

Guibin Zhang, Kaijie Chen, Guancheng Wan, Heng Chang, Hong Cheng, Kun Wang, Shuyue Hu, and Lei Bai. Evoflow: Evolving diverse agentic workflows on the fly, 2025 a . URL https://arxiv.org/abs/2502.07373

work page arXiv 2025
[29]

Darwin G

Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin godel machine: Open-ended evolution of self-improving agents, 2026. URL https://arxiv.org/abs/2505.22954

work page arXiv 2026
[30]

AFlow: Automating Agentic Workflow Generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation, 2025 b . URL https://arxiv.org/abs/2410.10762

work page internal anchor Pith review arXiv 2025
[31]

MLCopilot : Unleashing the Power of Large Language Models in Solving Machine Learning Tasks

Lei Zhang, Yuge Zhang, Kan Ren, Dongsheng Li, and Yuqing Yang. MLCopilot : Unleashing the Power of Large Language Models in Solving Machine Learning Tasks . In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics ( EACL ) , pages 2931--2959, 2024. doi:10.48550/arXiv.2304.14979. URL http://arxiv.org/ab...

work page doi:10.48550/arxiv.2304.14979 2024