pith. machine review for the scientific record. sign in

arxiv: 2508.07407 · v2 · submitted 2025-08-10 · 💻 cs.AI · cs.CL· cs.MA

Recognition: 2 theorem links

A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

Authors on Pith no claims yet

Pith reviewed 2026-05-15 23:17 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.MA
keywords self-evolving agentsAI agentsfoundation modelslifelong agentic systemsevolution techniquesfeedback loopsagent adaptabilitysafety and ethics
0
0 comments X

The pith

Self-evolving AI agents use interaction feedback to continuously improve beyond their initial foundation model capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey examines how AI agents can evolve automatically to handle dynamic environments, unlike static systems fixed after deployment. The authors present a four-component framework—System Inputs, Agent System, Environment, and Optimisers—to organize the various evolution techniques. They review methods that enhance each part of the system, including specialized approaches in biomedicine, programming, and finance. The work also addresses evaluation methods and safety concerns essential for reliable lifelong operation. Such a unified view helps researchers build agents that adapt over time rather than remaining limited by their starting configuration.

Core claim

The paper establishes that self-evolving agentic systems can be understood through a feedback loop abstracted into four key components: System Inputs, Agent System, Environment, and Optimisers. This framework enables systematic review of techniques that target different components for automatic enhancement based on interaction data and environmental feedback, along with domain-specific strategies and discussions on evaluation, safety, and ethics.

What carries the argument

A unified conceptual framework that abstracts self-evolving agentic systems into System Inputs, Agent System, Environment, and Optimisers, serving as the basis for classifying and comparing evolution strategies.

If this is right

  • Techniques can be developed to optimize specific components like the Agent System or Optimisers independently.
  • Domain constraints in fields such as finance or biomedicine can guide the choice of evolution objectives.
  • Continuous evaluation is necessary to track improvements in agent adaptability over time.
  • Safety protocols must evolve alongside the agents to prevent unintended behaviors in changing environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Integrating this framework with existing benchmarks could reveal gaps in current agent evolution methods.
  • Multi-agent systems might benefit from collective evolution strategies across the four components.
  • Long-term deployment tests in real-world settings would validate the framework's practical utility beyond theoretical classification.

Load-bearing premise

Existing techniques for agent evolution share enough common structure to be unified and compared within a single framework defined by System Inputs, Agent System, Environment, and Optimisers.

What would settle it

Discovery of an agent evolution method that fundamentally cannot be described using any combination of the four components, or empirical results showing no performance gain from evolution in dynamic tasks, would undermine the proposed unification.

read the original abstract

Recent advances in large language models have sparked growing interest in AI agents capable of solving complex, real-world tasks. However, most existing agent systems rely on manually crafted configurations that remain static after deployment, limiting their ability to adapt to dynamic and evolving environments. To this end, recent research has explored agent evolution techniques that aim to automatically enhance agent systems based on interaction data and environmental feedback. This emerging direction lays the foundation for self-evolving AI agents, which bridge the static capabilities of foundation models with the continuous adaptability required by lifelong agentic systems. In this survey, we provide a comprehensive review of existing techniques for self-evolving agentic systems. Specifically, we first introduce a unified conceptual framework that abstracts the feedback loop underlying the design of self-evolving agentic systems. The framework highlights four key components: System Inputs, Agent System, Environment, and Optimisers, serving as a foundation for understanding and comparing different strategies. Based on this framework, we systematically review a wide range of self-evolving techniques that target different components of the agent system. We also investigate domain-specific evolution strategies developed for specialised fields such as biomedicine, programming, and finance, where optimisation objectives are tightly coupled with domain constraints. In addition, we provide a dedicated discussion on the evaluation, safety, and ethical considerations for self-evolving agentic systems, which are critical to ensuring their effectiveness and reliability. This survey aims to provide researchers and practitioners with a systematic understanding of self-evolving AI agents, laying the foundation for the development of more adaptive, autonomous, and lifelong agentic systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript surveys self-evolving AI agents as a paradigm bridging static foundation models with lifelong agentic systems. It introduces a unified conceptual framework abstracting the underlying feedback loop into four components—System Inputs, Agent System, Environment, and Optimisers—then systematically reviews evolution techniques targeting these components, domain-specific strategies in biomedicine, programming, and finance, and considerations for evaluation, safety, and ethics.

Significance. If the four-component framework enables non-trivial comparisons and unification of techniques, the survey could organize an emerging area and help identify gaps between foundation-model capabilities and continuous adaptation. The domain-specific sections and dedicated safety/ethics discussion add practical value for researchers building lifelong agents.

major comments (2)
  1. [unified conceptual framework] The unified conceptual framework (described after the abstract) defines its four components at a level that can subsume nearly any agent loop, yet provides no explicit criteria, metrics, or decision procedure for assigning techniques to components. This is load-bearing for the claim that the framework serves as a foundation for understanding and comparing strategies, because techniques targeting multiple components simultaneously cannot be distinguished on evolutionary properties.
  2. [systematic review of techniques] In the systematic review of techniques targeting different components, the classification does not specify how overlaps or multi-component evolution methods are handled or compared; without such detail the unification remains loose grouping rather than enabling the meaningful cross-technique analysis asserted in the abstract.
minor comments (1)
  1. [abstract] The abstract states the survey is 'comprehensive' but does not indicate the number of papers reviewed, search methodology, or temporal scope; adding these would strengthen the claim of systematic coverage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your detailed review and valuable feedback on our manuscript. We appreciate the recognition of the potential value of the four-component framework and the domain-specific sections. We have carefully considered the major comments and provide point-by-point responses below. We plan to incorporate clarifications to enhance the rigor of the framework presentation.

read point-by-point responses
  1. Referee: [unified conceptual framework] The unified conceptual framework (described after the abstract) defines its four components at a level that can subsume nearly any agent loop, yet provides no explicit criteria, metrics, or decision procedure for assigning techniques to components. This is load-bearing for the claim that the framework serves as a foundation for understanding and comparing strategies, because techniques targeting multiple components simultaneously cannot be distinguished on evolutionary properties.

    Authors: We agree that providing explicit criteria would improve the framework's ability to support comparisons. In the manuscript, component assignments are determined by identifying the primary target of evolution (e.g., modifications to the Agent System versus changes in Optimisers), with multi-component techniques discussed in context. To strengthen this, we will add explicit guidelines in the framework description section, including a decision procedure based on the main feedback loop element affected and examples of boundary cases. This revision will enable clearer distinction for overlapping methods. revision: partial

  2. Referee: [systematic review of techniques] In the systematic review of techniques targeting different components, the classification does not specify how overlaps or multi-component evolution methods are handled or compared; without such detail the unification remains loose grouping rather than enabling the meaningful cross-technique analysis asserted in the abstract.

    Authors: The review classifies techniques according to their primary component target as per the framework, while noting overlaps explicitly where methods influence multiple components (for instance, in sections covering joint evolution strategies). Cross-technique analysis is enabled by comparing their impacts on the overall system loop. We will revise the systematic review section to include a dedicated explanation of the classification methodology for overlaps, such as categorization rules and how comparisons are drawn across groups. This will make the unification more precise and support the asserted analysis. revision: partial

Circularity Check

0 steps flagged

No circularity: survey taxonomy introduces no self-referential derivations

full rationale

The paper is a literature review that proposes a high-level four-component conceptual framework (System Inputs, Agent System, Environment, Optimisers) solely to classify and compare existing external techniques. No equations, fitted parameters, predictions, or derivations appear anywhere in the manuscript. The framework is introduced as an organizing abstraction rather than derived from or reducing to any self-defined quantities within the paper. All reviewed methods and domain strategies are cited from independent prior work, with no load-bearing self-citation chains or uniqueness theorems invoked. The bridging claim between foundation models and lifelong systems is therefore supported by the external literature surveyed, not by any internal construction that collapses to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, the work introduces no new free parameters, axioms, or invented entities; it aggregates and taxonomizes prior research under an organizing framework.

pith-pipeline@v0.9.0 · 5641 in / 1109 out tokens · 41199 ms · 2026-05-15T23:17:51.188719+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

    cs.AI 2026-05 unverdicted novelty 8.0

    Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.

  2. SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

    cs.AI 2026-05 accept novelty 8.0

    SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

  3. SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

    cs.AI 2026-05 unverdicted novelty 8.0

    SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.

  4. From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company

    cs.AI 2026-04 unverdicted novelty 7.0

    OMC framework turns multi-agent AI into self-organizing companies with Talents, Talent Market, and E²R search, achieving 84.67% success on PRDBench (15.48 points above prior art).

  5. RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

    cs.RO 2026-05 unverdicted novelty 6.0

    A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.

  6. Position: Assistive Agents Need Accessibility Alignment

    cs.AI 2026-05 conditional novelty 6.0

    Assistive agents for BVI users need accessibility alignment as a core design goal, with a proposed lifecycle pipeline, because sighted assumptions cause unfixable failures in verification, risk, and interaction.

  7. MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-...

  8. Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation

    cs.RO 2026-05 unverdicted novelty 6.0

    SAGE trains agents in physics-grounded semantic abstractions via RL with asymmetric clipping, achieving 53.21% LLM-Match Success on A-EQA (+9.7% over baseline) and encouraging physical robot transfer.

  9. Do Self-Evolving Agents Forget? Capability Degradation and Preservation in Lifelong LLM Agent Adaptation

    cs.AI 2026-05 unverdicted novelty 6.0

    Self-evolving LLM agents exhibit capability erosion under continual adaptation, which Capability-Preserving Evolution mitigates by raising retained simple-task performance from 41.8% to 52.8% in workflow evolution und...

  10. FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration

    cs.LG 2026-05 unverdicted novelty 6.0

    FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...

  11. Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation

    cs.AI 2026-05 unverdicted novelty 6.0

    A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 ...

  12. Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    TPGO represents multi-agent systems as graphs of textual parameters and applies group relative optimization to enable self-improvement from execution history.

  13. AutoSurrogate: An LLM-Driven Multi-Agent Framework for Autonomous Construction of Deep Learning Surrogate Models in Subsurface Flow

    cs.LG 2026-04 unverdicted novelty 6.0

    AutoSurrogate is a multi-agent LLM framework that autonomously constructs, tunes, and validates deep learning surrogates for subsurface flow from natural language, outperforming expert baselines on a 3D carbon storage task.

  14. ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

    cs.CR 2026-04 unverdicted novelty 6.0

    ADAM extracts data from LLM agent memory with up to 100% attack success rate by estimating data distribution and selecting queries via entropy guidance.

  15. AI-Driven Research for Databases

    cs.DB 2026-04 unverdicted novelty 6.0

    Co-evolving LLM-generated solutions with their evaluators enables discovery of novel database algorithms that outperform state-of-the-art baselines, including a query rewrite policy with up to 6.8x lower latency.

  16. Reflective Context Learning: Studying the Optimization Primitives of Context Space

    cs.LG 2026-04 unverdicted novelty 6.0

    Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, a...

  17. Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses

    cs.CR 2026-03 unverdicted novelty 6.0

    The survey organizes over 400 papers on embodied AI safety into a multi-level taxonomy and flags overlooked issues such as fragile multimodal fusion and unstable planning under jailbreaks.

  18. Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

    cs.AI 2026-05 conditional novelty 5.0

    The survey proposes the LIFE framework to unify fragmented research on collaboration, failure attribution, and self-evolution in LLM multi-agent systems into a progression toward self-organizing intelligence.

  19. Agentic Driving Coach: Robustness and Determinism of Agentic AI-Powered Human-in-the-Loop Cyber-Physical Systems

    cs.AI 2026-04 unverdicted novelty 4.0

    A Lingua Franca reactor-based method is proposed to address nondeterminism in agentic AI for human-in-the-loop cyber-physical systems such as driving coaches.

Reference graph

Works this paper leans on

126 extracted references · 126 canonical work pages · cited by 18 Pith papers · 22 internal anchors

  1. [1]

    Large language model guided self-debugging code generation.arXiv preprint arXiv:2502.02928,

    Muntasir Adnan, Zhiwei Xu, and Carlos CN Kuhn. Large language model guided self-debugging code generation.arXiv preprint arXiv:2502.02928,

  2. [2]

    Promptwizard: Task-aware prompt optimization framework.arXiv preprint arXiv:2405.18369,

    Eshaan Agarwal, Joykirat Singh, Vivek Dani, Raghav Magazine, Tanuja Ganu, and Akshay Nambi. Promptwizard: Task-aware prompt optimization framework.arXiv preprint arXiv:2405.18369,

  3. [3]

    Self-evolving multi-agent simulations for realistic clinical interactions.arXiv preprint arXiv:2503.22678,

    Mohammad Almansoori, Komal Kumar, and Hisham Cholakkal. Self-evolving multi-agent simulations for realistic clinical interactions.arXiv preprint arXiv:2503.22678,

  4. [4]

    Towards better human-agent alignment: Assessing task utility in llm-powered applications.arXiv preprint arXiv:2402.09015,

    Negar Arabzadeh, Julia Kiseleva, Qingyun Wu, Chi Wang, Ahmed Awadallah, Victor Dibia, Adam Fourney, and Charles Clarke. Towards better human-agent alignment: Assessing task utility in llm-powered applications.arXiv preprint arXiv:2402.09015,

  5. [5]

    arXiv preprint arXiv:2407.12865,

    34 Derek Austin and Elliott Chartock.GRAD-SUM: Leveraging gradient summarization for optimal prompt engineering. arXiv preprint arXiv:2407.12865,

  6. [6]

    Reza Averly, Frazier N Baker, and Xia Ning.LIDDIA: Language-based intelligent drug discovery agent.arXiv preprint arXiv:2502.13959,

  7. [7]

    Agents of change: Self-evolving llm agents for strategic planning.arXiv preprint arXiv:2506.04651,

    Nikolas Belle, Dakota Barnes, Alfonso Amayuelas, Ivan Bercovich, Xin Eric Wang, and William Wang. Agents of change: Self-evolving llm agents for strategic planning.arXiv preprint arXiv:2506.04651,

  8. [8]

    SUPER: evaluating agents on setting up and executing tasks from research repositories

    Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, and Tushar Khot. SUPER: evaluating agents on setting up and executing tasks from research repositories. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 12622–12645,

  9. [9]

    Chuxue Cao, Han Zhu, Jiaming Ji, Qichao Sun, Zhenghao Zhu, Yinyu Wu, Josef Dai, Yaodong Yang, Sirui Han, and Yike Guo

    Accessed: 2025-08-09. Chuxue Cao, Han Zhu, Jiaming Ji, Qichao Sun, Zhenghao Zhu, Yinyu Wu, Josef Dai, Yaodong Yang, Sirui Han, and Yike Guo. SafeLawBench: Towards safe alignment of large language models. InFindings of the Association for Computational Linguistics, pages 14015–14048,

  10. [10]

    Editing factual knowledge in language models

    Nicola De Cao, Wilker Aziz, and Ivan Titov. Editing factual knowledge in language models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6491–6506,

  11. [11]

    Shuyang Cao and Lu Wang.AWESOME: GPU memory-constrained long document summarization using memory mechanism and global salient content. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5925–5941,

  12. [12]

    Socrasynth: Multi-llm reasoning with conditional statistics.arXiv preprint arXiv:2402.06634,

    Edward Y Chang. Socrasynth: Multi-llm reasoning with conditional statistics.arXiv preprint arXiv:2402.06634,

  13. [13]

    Agent Network Protocol (ANP).https://github.com/ agent-network-protocol/AgentNetworkProtocol

    GaoWei Chang and Agent Network Protocol Contributors. Agent Network Protocol (ANP).https://github.com/ agent-network-protocol/AgentNetworkProtocol. MIT License, accessed 2025-07-31. Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen.CodeT: Code generation with generated tests. InThe Eleventh International Conferenc...

  14. [14]

    InThe Thirteenth International Conference on Learning Representations, 2024a

    Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Huichi Zhou, Qihui Zhang, Zhigang He, Yilin Bai, Chujie Gao, Liuyi Chen, et al.GUI-world: A video benchmark and dataset for multimodalGUI-oriented understanding. InThe Thirteenth International Conference on Learning Representations, 2024a. Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, Börje Ka...

  15. [15]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413,

  16. [16]

    Jarosław A Chudziak and Michał Wawer.ElliottAgents: a natural language-driven multi-agent system for stock market analysis and prediction.arXiv preprint arXiv:2507.03435,

  17. [17]

    Xing, and Zhiting Hu

    Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric P. Xing, and Zhiting Hu. RLPrompt: Optimizing discrete text prompts with reinforcement learning. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3369–3391,

  18. [18]

    Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, et al

    Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, and Ji-Rong Wen.Tool-Star: Empowering llm-brained multi-tool reasoner via reinforcement learning.arXiv preprint arXiv:2505.16410, 2025a. Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Hui...

  19. [19]

    Rethinking memory in ai: Taxonomy, operations, topics, and future directions.arXiv preprint arXiv:2505.00675,

    Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam-Fai Wong, and Jeff Z Pan. Rethinking memory in ai: Taxonomy, operations, topics, and future directions.arXiv preprint arXiv:2505.00675,

  20. [20]

    Retrieval-augmented generation-based relation extraction

    Sefika Efeoglu and Adrian Paschke. Retrieval-augmented generation-based relation extraction. arXiv preprint arXiv:2404.13397,

  21. [21]

    Debate only when necessary: Adaptive multiagent collaboration for efficient llm reasoning.arXiv preprint arXiv:2504.05047,

    Sugyeong Eo, Hyeonseok Moon, Evelyn Hayoon Zi, Chanjun Park, and Heuiseok Lim. Debate only when necessary: Adaptive multiagent collaboration for efficient llm reasoning.arXiv preprint arXiv:2504.05047,

  22. [22]

    Medrax: Medical reasoning agent for chest x-ray

    Adibvafa Fallahpour, Jun Ma, Alif Munim, Hongwei Lyu, and Bo Wang. Medrax: Medical reasoning agent for chest x-ray. arXiv preprint arXiv:2502.02673,

  23. [23]

    Play2prompt: Zero-shot tool instruction optimization for llm agents via tool play.arXiv preprint arXiv:2503.14432,

    Wei Fang, Yang Zhang, Kaizhi Qian, James Glass, and Yada Zhu. Play2prompt: Zero-shot tool instruction optimization for llm agents via tool play.arXiv preprint arXiv:2503.14432,

  24. [24]

    Xiang Fei, Xiawu Zheng, and Hao Feng.MCP-Zero: Active tool discovery for autonomous llm agents.arXiv preprint arXiv:2506.01056,

  25. [25]

    ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

    Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv preprint arXiv:2504.11536, 2025a. Jinghao Feng, Qiaoyu Zheng, Chaoyi Wu, Ziheng Zhao, Ya Zhang, Yanfeng Wang, and Weidi Xie. M3builder: A multi-agent system for autom...

  26. [26]

    Magentic-one: A generalist multi-agent system for solving complex tasks

    Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, et al. Magentic-one: A generalist multi-agent system for solving complex tasks. arXiv preprint arXiv:2411.04468,

  27. [27]

    FlowReasoner: Reinforcing query-level meta-agents.arXiv preprint arXiv:2504.15257, 2025a

    Hongcheng Gao, Yue Liu, Yufei He, Longxu Dou, Chao Du, Zhijie Deng, Bryan Hooi, Min Lin, and Tianyu Pang. FlowReasoner: Reinforcing query-level meta-agents.arXiv preprint arXiv:2504.15257, 2025a. Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agent...

  28. [28]

    Pathfinder: A multi-modal multi-agent system for medical diagnostic decision-making applied to histopathology.arXiv preprint arXiv:2502.08916,

    37 Fatemeh Ghezloo, Mehmet Saygin Seyfioglu, Rustin Soraki, Wisdom O Ikezogwo, Beibin Li, Tejoram Vivekanandan, Joann G Elmore, Ranjay Krishna, and Linda Shapiro. Pathfinder: A multi-modal multi-agent system for medical diagnostic decision-making applied to histopathology.arXiv preprint arXiv:2502.08916,

  29. [29]

    Synthetic data generation & multi-step RL for reasoning & tool use.arXiv preprint arXiv:2504.04736,

    Anna Goldie, Azalia Mirhoseini, Hao Zhou, Irene Cai, and Christopher D.Manning Manning. Synthetic data generation & multi-step RL for reasoning & tool use.arXiv preprint arXiv:2504.04736,

  30. [30]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Let- man, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  31. [31]

    Zhouhong Gu, Xiaoxuan Zhu, Yin Cai, Hao Shen, Xingzhou Chen, Qingyi Wang, Jialin Li, Xiaoran Shi, Haoran Guo, Wenxuan Huang, et al.AgentGroupChat-V2: Divide-and-conquer is what llm-based multi-agent system need.arXiv preprint arXiv:2506.15451,

  32. [32]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Chengquan Guo, Xun Liu, Chulin Xie, Andy Zhou, Yi Zeng, Zinan Lin, Dawn Song, and Bo Li. Redcode: Risky code execution and generation benchmark for code agents.Advances in Neural Information Processing Systems, 37: 106190–106236, 2024a. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, e...

  33. [33]

    arXiv preprint arXiv:2402.03578,

    Shanshan Han, Qifan Zhang, Yuhang Yao, Weizhao Jin, and Zhaozhuo Xu.LLM multi-agent systems: Challenges and open problems. arXiv preprint arXiv:2402.03578,

  34. [34]

    AgentsCourt: Building judicial decision-making agents with court debate simulation and legal knowledge augmentation

    Zhitao He, Pengfei Cao, Chenhao Wang, Zhuoran Jin, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, and Jun Zhao. AgentsCourt: Building judicial decision-making agents with court debate simulation and legal knowledge augmentation. arXiv preprint arXiv:2403.02959,

  35. [35]

    Halo: Hierarchical autonomous logic-oriented orchestration for multi-agent llm systems

    Zhipeng Hou, Junyi Tang, and Yipeng Wang. Halo: Hierarchical autonomous logic-oriented orchestration for multi-agent llm systems. arXiv preprint arXiv:2505.13516,

  36. [36]

    38 Chenxu Hu, Jie Fu, Chenzhuang Du, Simian Luo, Junbo Zhao, and Hang Zhao.ChatDB: Augmenting llms with databases as their symbolic memory.arXiv preprint arXiv:2306.03901,

  37. [37]

    InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7436–7465, 2025b

    Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, et al.OS agents: A survey on mllm-based agents for computer, phone and browser use. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7436–7465, 2025b. Zhaolin Hu, Yix...

  38. [38]

    R-Zero: Self-Evolving Reasoning LLM from Zero Data

    Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-Zero: Self-evolving reasoning llm from zero data.arXiv preprint arXiv:2508.05004,

  39. [39]

    AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

    Dong Huang, Jie M Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui.AgentCoder: Multi-agent-based code generation with iterative testing and optimisation.arXiv preprint arXiv:2312.13010, 2023a. Jen-tse Huang, Jiaxu Zhou, Tailin Jin, Xuhui Zhou, Zixi Chen, Wenxuan Wang, Youliang Yuan, Michael R Lyu, and Maarten Sap. On the resilience of llm-based...

  40. [40]

    Drugagent: Multi-agent large language model-based reasoning for drug-target interaction prediction

    Yoshitaka Inoue, Tianci Song, Xinling Wang, Augustin Luna, and Tianfan Fu. Drugagent: Multi-agent large language model-based reasoning for drug-target interaction prediction. InICLR 2025 Workshop on Machine Learning for Genomics Explorations,

  41. [41]

    A Survey on Large Language Models for Code Generation

    39 Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation. arXiv preprint arXiv:2406.00515,

  42. [42]

    Chen, and Shafiq Joty

    Fangkai Jiao, Chengwei Qin, Zhengyuan Liu, Nancy F. Chen, and Shafiq Joty. Learning planning-based reasoning by trajectories collection and process reward synthesizing. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 334–350,

  43. [43]

    In 2024 IEEE International Conference on Agents (ICA), pages 136–141

    Haolin Jin, Zechao Sun, and Huaming Chen.RGD: Multi-LLM based agent debugger via refinement and generation guidance. In 2024 IEEE International Conference on Agents (ICA), pages 136–141. IEEE,

  44. [44]

    Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, and Shafiq Joty.MAS-ZERO: Designing multi-agent systems with zero supervision.arXiv preprint arXiv:2505.14996,

  45. [45]

    Seven security challenges that must be solved in cross-domain multi-agent llm systems.arXiv preprint arXiv:2505.23847,

    Ronny Ko, Jiseong Jeong, Shuyuan Zheng, Chuan Xiao, Tae-Wan Kim, Makoto Onizuka, and Won-Yong Shin. Seven security challenges that must be solved in cross-domain multi-agent llm systems.arXiv preprint arXiv:2505.23847,

  46. [46]

    doi: https://doi.org/10.1016/S0933-3657(01)00077-X

    ISSN 0933-3657. doi: https://doi.org/10.1016/S0933-3657(01)00077-X. Naveen Krishnan. Advancing multi-agent systems through model context protocol: Architecture, implementation, and applications. arXiv preprint arXiv:2504.21030,

  47. [47]

    Accessed: 2025-01-22. Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, and Jie Tang.AutoWebGLM: A large language model-based web navigating agent. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5295–5306. ACM, 2024a. Xin Lai, Zhuota...

  48. [48]

    A unified debugging approach via llm-based multi-agent synergy.arXiv preprint arXiv:2404.17153, 2024a

    Cheryl Lee, Chunqiu Steven Xia, Longji Yang, Jen-tse Huang, Zhouruixin Zhu, Lingming Zhang, and Michael R Lyu. A unified debugging approach via llm-based multi-agent synergy.arXiv preprint arXiv:2404.17153, 2024a. 40 Dongkyu Lee, Chandana Satya Prakash, Jack FitzGerald, and Jens Lehmann.MATTER:memory-augmented transformer using heterogeneous knowledge sou...

  49. [49]

    MMedAgent: Learning to use medical tools with multi-modal agent.arXiv preprint arXiv:2407.02483, 2024a

    Binxu Li, Tiankai Yan, Yuanting Pan, Jie Luo, Ruiyang Ji, Jiayuan Ding, Zhe Xu, Shilong Liu, Haoyu Dong, Zihao Lin, et al. MMedAgent: Learning to use medical tools with multi-modal agent.arXiv preprint arXiv:2407.02483, 2024a. Boyi Li, Zhonghan Zhao, Der-Horng Lee, and Gaoang Wang. Adaptive graph pruning for multi-agent communication. arXiv preprint arXiv...

  50. [50]

    WebThinker: Empowering large reasoning models with deep research capability.arXiv preprint arXiv:2504.21776, 2025g

    Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, and Zhicheng Dou. WebThinker: Empowering large reasoning models with deep research capability.arXiv preprint arXiv:2504.21776, 2025g. Xin Sky Li, Qizhi Chu, Yubin Chen, Yang Liu, Yaoqi Liu, Zekai Yu, Weize Chen, Chen Qian, Chuan Shi, and Cheng Yang. GraphTeam: Facilit...

  51. [51]

    Junwei Liao, Muning Wen, Jun Wang, and Weinan Zhang.MARFT: Multi-agent reinforcement fine-tuning.arXiv preprint arXiv:2504.16129,

  52. [52]

    Prompt optimization with human feedback.arXiv preprint arXiv:2405.17346, 2024a

    Xiaoqiang Lin, Zhongxiang Dai, Arun Verma, See-Kiong Ng, Patrick Jaillet, and Bryan Kian Hsiang Low. Prompt optimization with human feedback.arXiv preprint arXiv:2405.17346, 2024a. Xiaoqiang Lin, Zhaoxuan Wu, Zhongxiang Dai, Wenyang Hu, Yao Shu, See-Kiong Ng, Patrick Jaillet, and Bryan Kian Hsiang Low. Use your INSTINCT: instruction optimization for llms ...

  53. [53]

    Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990, 2025a

    Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, et al. Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990, 2025a. Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi,...

  54. [54]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. TheAI scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024a. Junru Lu, Siyu An, Mingbao Lin, Gabriele Pergola, Yulan He, Di Yin, Xing Sun, and Yunsheng Wu.MemoChat: Tuning LLMs to use memos for consistent long-range open-do...

  55. [55]

    arXiv preprint arXiv:2410.15048, 2024b

    Siyuan Lu, Jiaqi Shao, Bing Luo, and Tao Lin.MorphAgent: Empowering agents through self-evolving profiles and decentralized collaboration. arXiv preprint arXiv:2410.15048, 2024b. Yao Lu, Jiayi Wang, Raphael Tang, Sebastian Riedel, and Pontus Stenetorp. Strings from the library of babel: Random sampling as a strong baseline for prompt optimisation. InProce...

  56. [56]

    Agentic neural networks: Self-evolving multi-agent systems via textual backpropagation.arXiv preprint arXiv:2506.09046,

    Xiaowen Ma, Chenyang Lin, Yao Zhang, Volker Tresp, and Yunpu Ma. Agentic neural networks: Self-evolving multi-agent systems via textual backpropagation.arXiv preprint arXiv:2506.09046,

  57. [57]

    SciAgent: Tool-augmented language models for scientific reasoning

    Yubo Ma, Zhibin Gou, Junheng Hao, Ruochen Xu, Shuohang Wang, Liangming Pan, Yujiu Yang, Yixin Cao, and Aixin Sun. SciAgent: Tool-augmented language models for scientific reasoning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing 2024, pages 15701–15736,

  58. [58]

    Agora Protocol (AGORA).https://agoraprotocol.org/

    Samuele Marro and Agora Protocol Contributors. Agora Protocol (AGORA).https://agoraprotocol.org/. MIT License, accessed 2025-07-31. 43 Andrew D McNaughton, Gautham Krishna Sankar Ramalaxmi, Agustin Kruel, Carter R Knutson, Rohith A Varikoti, and Neeraj Kumar. Cactus: Chemistry agent connecting tool usage to science.ACS omega, 9(46):46563–46573,

  59. [59]

    Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems

    Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, et al. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems. arXiv preprint arXiv:2412.09413,

  60. [60]

    Ali Modarressi, Ayyoob Imani, Mohsen Fayyaz, and Hinrich Schütze.RET-LLM: Towards a general read-write memory for large language models.arXiv preprint arXiv:2305.14322,

  61. [61]

    arXiv preprint arXiv:2412.01928,

    Sumeet Ramesh Motwani, Chandler Smith, Rocktim Jyoti Das, Rafael Rafailov, Ivan Laptev, Philip HS Torr, Fabio Pizzati, Ronald Clark, and Christian Schroeder de Witt.MALT: Improving reasoning with multi-agent llm training. arXiv preprint arXiv:2412.01928,

  62. [62]

    https://github.com/ browser-use/browser-use. Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, et al.MLGym: A new framework and benchmark for advancing ai research agents.arXiv preprint arXiv:2502.14499,

  63. [63]

    Alexander Novikov, Ngân V˜ u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al.AlphaEvolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131,

  64. [64]

    Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab

    Krista Opsahl-Ong, Michael J. Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9340–9366,

  65. [65]

    Charles Packer, Vivian Fang, Shishir G Patil, Kevin Lin, Sarah Wooders, and Joseph E Gonzalez.MemGPT: Towards llms as operating systems.arXiv preprint arXiv:2310.08560,

  66. [66]

    Gonzalez, Matei Zaharia, and Ion Stoica

    Melissa Z Pan, Mert Cemri, Lakshya A Agrawal, Shuyi Yang, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Kannan Ramchandran, Dan Klein, Joseph E. Gonzalez, Matei Zaharia, and Ion Stoica. Why do Multi-Agent systems fail? InICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025a. Rui Pan, Shuo Xing, Shizhe Diao, We...

  67. [67]

    arXiv preprint arXiv:2411.06736,

    Junyeong Park, Junmo Cho, and Sungjin Ahn.MrSteve: Instruction-following agents in minecraft with what-where-when memory. arXiv preprint arXiv:2411.06736,

  68. [68]

    Model Context Protocol (MCP)

    Anthropic PBC and Model Context Protocol Contributors. Model Context Protocol (MCP). https:// modelcontextprotocol.io/overview. MIT License, accessed 2025-07-31. Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych.AdapterFusion: Non- destructive task composition for transfer learning. InProceedings of the 16th Conference of...

  69. [69]

    Akshara Prabhakar, Zuxin Liu, Ming Zhu, Jianguo Zhang, Tulika Awalgaonkar, Shiyu Wang, Zhiwei Liu, Haolin Chen, Thai Hoang, Juan Carlos Niebles, et al.APIGen-MT: Agentic pipeline for multi-turn data generation via simulated agent-human interplay.arXiv preprint arXiv:2504.03601,

  70. [70]

    gradient descent

    Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with "gradient descent" and beam search. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7957–7968,

  71. [71]

    arXiv preprint arXiv:2505.15047,

    Yingming Pu, Tao Lin, and Hongyu Chen.PiFlow: Principle-aware scientific discovery with multi-agent collaboration. arXiv preprint arXiv:2505.15047,

  72. [72]

    Agent Q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199,

    Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent Q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199,

  73. [73]

    In2023 Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6922–6939

    Cheng Qian, Chi Han, Yi R Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji.CREATOR: Tool creation for disentangling abstract and concrete reasoning of large language models. In2023 Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6922–6939. Association for Computational Linguistics (ACL),

  74. [74]

    ToolRL: Reward is All Tool Learning Needs

    Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. ToolRL: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025a. Yiyue Qian, Shinan Zhang, Yun Zhou, Haibo Ding, Diego Socolinsky, and Yi Zhang. EnhancingLLM-as-a-Judge via multi-agent collaboration.amazon.science, 2025b. 45 Shuofei Q...

  75. [75]

    Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution

    Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, et al. Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution. arXiv preprint arXiv:2505.20286,

  76. [76]

    Asif Rahman, Veljko Cvetkovic, Kathleen Reece, Aidan Walters, Yasir Hassan, Aneesh Tummeti, Bryan Torres, Denise Cooney, Margaret Ellis, and Dimitrios S Nikolopoulos.MARCO: Multi-agent code optimization with real-time knowledge integration for high-performance computing.arXiv preprint arXiv:2505.03906,

  77. [77]

    CodePori: Large-scale system for autonomous software development using multi-agent technology

    Zeeshan Rasheed, Malik Abdul Sami, Kai-Kristian Kemell, Muhammad Waseem, Mika Saari, Kari Systä, and Pekka Abrahamsson. CodePori: Large-scale system for autonomous software development using multi-agent technology. arXiv preprint arXiv:2402.01411,

  78. [78]

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al.AndroidWorld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573,

  79. [79]

    Shaina Raza, Ranjan Sapkota, Manoj Karkee, and Christos Emmanouilidis.TRiSM for agentic ai: A review of trust, risk, and security management in llm-based agentic multi-agent systems.arXiv preprint arXiv:2506.04133,

  80. [80]

    Liveideabench: Evaluating llms’ divergent thinking for scientific idea generation with minimal context.arXiv preprint arXiv:2412.17596,

    Kai Ruan, Xuan Wang, Jixiang Hong, Peng Wang, Yang Liu, and Hao Sun. Liveideabench: Evaluating llms’ divergent thinking for scientific idea generation with minimal context.arXiv preprint arXiv:2412.17596,

Showing first 80 references.