pith. machine review for the scientific record. sign in

arxiv: 2504.19678 · v2 · submitted 2025-04-28 · 💻 cs.AI · cs.LG

Recognition: 1 theorem link

From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:53 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords LLM benchmarksAI agent frameworksevaluation taxonomymulti-agent collaborationreasoning taskscode generation benchmarksmultimodal evaluationautonomous agents
0
0 comments X

The pith

A review organizes roughly 60 benchmarks for large language models and autonomous agents into one taxonomy covering reasoning, code, and real-world tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper consolidates scattered benchmarks and frameworks for large language models and AI agents that have appeared since 2019. It lines up side-by-side comparisons and groups the tests into categories such as mathematical problem-solving, code generation, factual retrieval, multimodal tasks, and interactive assessments. The work also surveys agent frameworks from 2023 onward, lists applications in science and industry, and examines protocols that let agents communicate with one another. A reader would care because the growing number of models and tools now requires clearer maps to decide which evaluations actually measure useful progress. The review ends by pointing to open questions around advanced reasoning, failure modes, and security in multi-agent systems.

Core claim

The landscape of evaluation benchmarks and AI-agent frameworks remains fragmented, and a unified taxonomy of approximately sixty benchmarks developed between 2019 and 2025 supplies the missing structure. These benchmarks are grouped under general knowledge reasoning, mathematical problem-solving, code generation and software engineering, factual grounding, domain-specific tests, multimodal and embodied tasks, task orchestration, and interactive assessments. The paper further reviews modular agent frameworks, real-world deployments in materials science through finance, and three agent-to-agent collaboration protocols.

What carries the argument

The proposed taxonomy of approximately 60 benchmarks that groups evaluations into eight domain categories and enables side-by-side comparison across years.

If this is right

  • Researchers gain a single reference to compare model performance across mathematical, code, and embodied domains.
  • Agent frameworks can be evaluated consistently against the same set of orchestration and interaction tests.
  • Real-world applications in healthcare, finance, and materials science receive clearer benchmark baselines.
  • Collaboration protocols such as ACP, MCP, and A2A become easier to test against shared failure-mode criteria.
  • Future work on reinforcement-learning tool integration and automated scientific discovery can build on the identified gaps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The taxonomy could seed a living public repository that adds new benchmarks automatically as they appear.
  • Security testing for agent protocols may need its own dedicated category to keep pace with deployment risks.
  • Connections between the embodied-task benchmarks and physical robotics experiments remain untested in the current grouping.
  • Dynamic tool use via reinforcement learning could be evaluated by extending the existing task-orchestration category.

Load-bearing premise

The selected benchmarks and frameworks are assumed to cover the main landscape without large omissions or selection bias.

What would settle it

A complete inventory of all benchmarks published 2019-2025 that reveals many omitted from the taxonomy or shows that the eight-category grouping misses major overlaps or gaps.

read the original abstract

Large language models and autonomous AI agents have evolved rapidly, resulting in a diverse array of evaluation benchmarks, frameworks, and collaboration protocols. Driven by the growing need for standardized evaluation and integration, we systematically consolidate these fragmented efforts into a unified framework. However, the landscape remains fragmented and lacks a unified taxonomy or comprehensive survey. Therefore, we present a side-by-side comparison of benchmarks developed between 2019 and 2025 that evaluate these models and agents across multiple domains. In addition, we propose a taxonomy of approximately 60 benchmarks that cover general and academic knowledge reasoning, mathematical problem-solving, code generation and software engineering, factual grounding and retrieval, domain-specific evaluations, multimodal and embodied tasks, task orchestration, and interactive assessments. Furthermore, we review AI-agent frameworks introduced between 2023 and 2025 that integrate large language models with modular toolkits to enable autonomous decision-making and multi-step reasoning. Moreover, we present real-world applications of autonomous AI agents in materials science, biomedical research, academic ideation, software engineering, synthetic data generation, chemical reasoning, mathematical problem-solving, geographic information systems, multimedia, healthcare, and finance. We then survey key agent-to-agent collaboration protocols, namely the Agent Communication Protocol (ACP), the Model Context Protocol (MCP), and the Agent-to-Agent Protocol (A2A). Finally, we discuss recommendations for future research, focusing on advanced reasoning strategies, failure modes in multi-agent LLM systems, automated scientific discovery, dynamic tool integration via reinforcement learning, integrated search capabilities, and security vulnerabilities in agent protocols.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to systematically consolidate fragmented efforts in LLM reasoning and autonomous AI agents into a unified framework. It provides a side-by-side comparison and proposes a taxonomy of approximately 60 benchmarks (2019–2025) covering general/academic knowledge, mathematical problem-solving, code generation, factual grounding, domain-specific tasks, multimodal/embodied work, task orchestration, and interactive assessments. It further reviews 2023–2025 agent frameworks, real-world applications across materials science, biomedicine, software engineering and other domains, agent-to-agent protocols (ACP, MCP, A2A), and offers future research recommendations on reasoning strategies, failure modes, and security.

Significance. A reproducible, exhaustive taxonomy and comparison of recent benchmarks would be a useful consolidation for a fast-moving field, reducing duplication and aiding standardized evaluation. The review of frameworks, applications, and protocols adds practical value if the coverage is representative rather than selective.

major comments (2)
  1. [Abstract / Taxonomy Proposal] The central claim of a 'unified taxonomy' of ~60 benchmarks (abstract and taxonomy section) rests on an unspecified selection process. No search strategy, databases, keywords, date ranges, or inclusion/exclusion criteria are provided, nor is the number of screened versus included items reported. This directly undermines verifiability of exhaustiveness across the listed domains.
  2. [Benchmark Comparison] The side-by-side benchmark comparison (abstract) cannot be assessed for balance or bias without the methodology details above; domain coverage (e.g., multimodal vs. code generation) may reflect author preference rather than literature distribution, weakening the unification argument.
minor comments (1)
  1. [Review of Frameworks] Clarify the exact time windows used for frameworks (2023–2025) versus benchmarks (2019–2025) and ensure consistent citation of all included works.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on methodological transparency. We agree that explicit details on the literature search process are needed to support claims of exhaustiveness and will revise the manuscript to address this.

read point-by-point responses
  1. Referee: [Abstract / Taxonomy Proposal] The central claim of a 'unified taxonomy' of ~60 benchmarks (abstract and taxonomy section) rests on an unspecified selection process. No search strategy, databases, keywords, date ranges, or inclusion/exclusion criteria are provided, nor is the number of screened versus included items reported. This directly undermines verifiability of exhaustiveness across the listed domains.

    Authors: We acknowledge that the current manuscript does not include an explicit description of the search methodology. The taxonomy was compiled via a broad review of arXiv, Google Scholar, and major NLP venues for papers from 2019–2025 using keywords such as 'LLM benchmark', 'agent evaluation', 'multimodal reasoning', and 'autonomous agent framework'. In the revised version we will add a 'Literature Search Methodology' subsection (with a PRISMA-style flow) reporting databases, exact keywords, date filters, inclusion criteria (empirical evaluations of LLM-based agents or benchmarks), and screened/included counts to enable verification of coverage. revision: yes

  2. Referee: [Benchmark Comparison] The side-by-side benchmark comparison (abstract) cannot be assessed for balance or bias without the methodology details above; domain coverage (e.g., multimodal vs. code generation) may reflect author preference rather than literature distribution, weakening the unification argument.

    Authors: We agree that balance cannot be fully evaluated without the search details. Once the methodology section is added, we will also include a brief discussion of domain prevalence in the retrieved literature and note any areas where coverage is sparser (e.g., certain embodied tasks). The comparison table will be framed as a representative sample drawn from the identified works rather than an exhaustive census, thereby reducing the risk of perceived selection bias. revision: yes

Circularity Check

0 steps flagged

No circularity in survey taxonomy or framework proposal

full rationale

This is a literature review paper that aggregates and organizes existing benchmarks, frameworks, and protocols from 2019-2025 without any mathematical derivations, fitted parameters, or equations. The proposed taxonomy of ~60 benchmarks is presented as an organizational synthesis of cited prior work rather than a quantity derived from self-referential inputs or self-citations. No load-bearing step reduces to a tautology, self-definition, or author-specific uniqueness theorem; the central claims rest on external citations and descriptive coverage rather than internal construction. The absence of a detailed search protocol is a methodological limitation but does not create circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the paper is a synthesis of prior published work on LLM agents and benchmarks.

pith-pipeline@v0.9.0 · 5584 in / 1053 out tokens · 40369 ms · 2026-05-15T02:53:18.046467+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Tools as Continuous Flow for Evolving Agentic Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    FlowAgent models tool chaining as continuous latent trajectory generation with conditional flow matching to deliver global planning, formal utility bounds, and better robustness on long-horizon tasks, plus a new plan-...

  2. Token Warping Helps MLLMs Look from Nearby Viewpoints

    cs.CV 2026-04 unverdicted novelty 7.0

    Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.

  3. Exploiting LLM Agent Supply Chains via Payload-less Skills

    cs.CR 2026-05 conditional novelty 6.0

    Semantic Compliance Hijacking lets attackers hijack LLM agents by disguising malicious instructions as compliance rules in skills, reaching up to 77.67% success on confidentiality breaches and 67.33% on RCE while evad...

  4. Position: Assistive Agents Need Accessibility Alignment

    cs.AI 2026-05 conditional novelty 6.0

    Assistive agents for BVI users need accessibility alignment as a core design goal, with a proposed lifecycle pipeline, because sighted assumptions cause unfixable failures in verification, risk, and interaction.

  5. STAR: Failure-Aware Markovian Routing for Multi-Agent Spatiotemporal Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    STAR combines expert nominal routes with trace-learned recovery transitions in a failure-typed routing matrix, improving multi-agent spatiotemporal reasoning over baselines especially on error-deviating queries.

  6. Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation

    cs.AI 2026-05 unverdicted novelty 6.0

    A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 ...

  7. LATTICE: Evaluating Decision Support Utility of Crypto Agents

    cs.CR 2026-04 unverdicted novelty 6.0

    LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.

  8. QuantClaw: Precision Where It Matters for OpenClaw

    cs.AI 2026-04 unverdicted novelty 6.0

    QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.

  9. Understanding the Mechanism of Altruism in Large Language Models

    econ.GN 2026-04 unverdicted novelty 6.0

    A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.

  10. Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    EvoOR-Agent co-evolves agent architectures as AOE-style networks with graph-mediated recombination and knowledge-base-assisted mutation to outperform fixed LLM pipelines on OR benchmarks.

  11. MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search

    cs.IR 2026-04 unverdicted novelty 6.0

    MemSearch-o1 uses reasoning-aligned memory growth from seed tokens, retracing via contribution functions, and path reorganization to mitigate memory dilution in LLM agentic search.

  12. MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search

    cs.IR 2026-04 unverdicted novelty 6.0

    MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.

  13. Agentic Frameworks for Reasoning Tasks: An Empirical Study

    cs.AI 2026-04 unverdicted novelty 6.0

    An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.

  14. AgentComm: Semantic Communication for Embodied Agents

    eess.SP 2026-04 unverdicted novelty 6.0

    AgentComm achieves nearly 50% bandwidth reduction in embodied agent communication via LLM semantic processing, importance-aware transmission, and a task knowledge base, with negligible impact on task completion.

  15. STAR: Failure-Aware Markovian Routing for Multi-Agent Spatiotemporal Reasoning

    cs.AI 2026-05 unverdicted novelty 5.0

    STAR is a failure-aware Markovian router that learns recovery transitions from both successful and unsuccessful execution traces to improve multi-agent performance on spatiotemporal benchmarks.

  16. A Low-Latency Fraud Detection Layer for Detecting Adversarial Interaction Patterns in LLM-Powered Agents

    cs.AI 2026-05 unverdicted novelty 5.0

    Researchers developed a fast XGBoost-based detector using 42 runtime features to spot adversarial interaction patterns in LLM agents, running over 9 times faster than LLM detectors on synthetic multi-turn data.

  17. AgentDID: Trustless Identity Authentication for AI Agents

    cs.CR 2026-04 unverdicted novelty 5.0

    AgentDID is a W3C-compliant decentralized identity system for AI agents enabling self-managed authentication and state verification via challenge-response.

  18. Intention-Aware Semantic Agent Communications for AI Glasses

    eess.SP 2026-04 unverdicted novelty 5.0

    An intention-aware semantic agent system for AI glasses reduces bandwidth by over 50% in simulations while preserving task performance through adaptive preprocessing guided by inferred user intentions.

  19. Multi-Agent Systems: From Classical Paradigms to Large Foundation Model-Enabled Futures

    cs.AI 2026-04 unverdicted novelty 4.0

    A survey comparing classical multi-agent systems with large foundation model-enabled multi-agent systems, showing how the latter enables semantic-level collaboration and greater adaptability.

  20. A Survey of Context Engineering for Large Language Models

    cs.CL 2025-07 accept novelty 4.0

    The survey organizes Context Engineering into retrieval, processing, management, and integrated systems like RAG and multi-agent setups while identifying an asymmetry where LLMs handle complex inputs well but struggle...

  21. Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

    cs.AI 2025-01 unverdicted novelty 4.0

    Agentic RAG embeds agents with reflection, planning, tool use, and collaboration into retrieval pipelines to overcome static RAG limitations, and the survey offers a taxonomy by agent count, control, autonomy, and kno...

  22. A Review of Large Language Models for Stock Price Forecasting from a Hedge-Fund Perspective

    q-fin.PR 2026-04 unverdicted novelty 3.0

    This review synthesizes LLM uses in stock forecasting and catalogs key practical pitfalls from a hedge-fund viewpoint.

Reference graph

Works this paper leans on

236 extracted references · 236 canonical work pages · cited by 20 Pith papers · 34 internal anchors

  1. [1]

    OpenAI o1 System Card

    A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carneyet al., “Openai o1 system card,”arXiv preprint arXiv:2412.16720, 2024

  2. [2]

    Qwen2.5-Omni Technical Report

    J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-omni technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2503.20215

  3. [3]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  4. [4]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  5. [5]

    Understanding the planning of LLM agents: A survey

    X. Huang, W. Liu, X. Chen, X. Wang, H. Wang, D. Lian, Y . Wang, R. Tang, and E. Chen, “Understanding the planning of llm agents: A survey,”arXiv preprint arXiv:2402.02716, 2024

  6. [6]

    A Survey on LLM-as-a-Judge

    J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liuet al., “A survey on llm-as-a-judge,”arXiv preprint arXiv:2411.15594, 2024. 41

  7. [7]

    I know which llm wrote your code last summer: Llm generated code stylometry for authorship attribution,

    T. Bisztray, B. Cherif, R. A. Dubniczky, N. Gruschka, B. Borsos, M. A. Ferrag, A. Kovacs, V . Mavroeidis, and N. Tihanyi, “I know which llm wrote your code last summer: Llm generated code stylometry for authorship attribution,”arXiv preprint arXiv:2506.17323, 2025

  8. [8]

    Vidorag: Visual document retrieval-augmented generation via dynamic iterative reasoning agents,

    Q. Wang, R. Ding, Z. Chen, W. Wu, S. Wang, P. Xie, and F. Zhao, “Vidorag: Visual document retrieval-augmented generation via dynamic iterative reasoning agents,”arXiv preprint arXiv:2502.18017, 2025

  9. [9]

    Benchmarking multimodal retrieval augmented generation with dynamic vqa dataset and self-adaptive planning agent,

    Y . Li, Y . Li, X. Wang, Y . Jiang, Z. Zhang, X. Zheng, H. Wang, H.-T. Zheng, P. Xie, P. S. Yuet al., “Benchmarking multimodal retrieval augmented generation with dynamic vqa dataset and self-adaptive planning agent,”arXiv preprint arXiv:2411.02937, 2024

  10. [10]

    Rag-kg-il: A multi-agent hybrid framework for reducing hallucinations and enhancing llm reasoning through rag and incremental knowledge graph learning integration,

    H. Q. Yu and F. McQuade, “Rag-kg-il: A multi-agent hybrid framework for reducing hallucinations and enhancing llm reasoning through rag and incremental knowledge graph learning integration,”arXiv preprint arXiv:2503.13514, 2025

  11. [11]

    Bioragent: A retrieval-augmented gener- ation system for showcasing generative query expansion and domain- specific search for scientific q&a,

    S. Ateia and U. Kruschwitz, “Bioragent: A retrieval-augmented gener- ation system for showcasing generative query expansion and domain- specific search for scientific q&a,”arXiv preprint arXiv:2412.12358, 2024

  12. [12]

    Retrieval-augmented simulacra: Generative agents for up-to-date and knowledge-adaptive simulations,

    H. Shimadzu, T. Utsuro, and D. Kitayama, “Retrieval-augmented simulacra: Generative agents for up-to-date and knowledge-adaptive simulations,”arXiv preprint arXiv:2503.14620, 2025

  13. [13]

    Rag-gym: Optimizing reasoning and search agents with process supervision,

    G. Xiong, Q. Jin, X. Wang, Y . Fang, H. Liu, Y . Yang, F. Chen, Z. Song, D. Wang, M. Zhanget al., “Rag-gym: Optimizing reasoning and search agents with process supervision,”arXiv preprint arXiv:2502.13957, 2025

  14. [14]

    Reasoning beyond limits: Advances and open problems for llms,

    M. A. Ferrag, N. Tihanyi, and M. Debbah, “Reasoning beyond limits: Advances and open problems for llms,” 2025. [Online]. Available: https://arxiv.org/abs/2503.22732

  15. [15]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  16. [16]

    Gemini: A Family of Highly Capable Multimodal Models

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millicanet al., “Gem- ini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

  17. [17]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

  18. [18]

    Schmidgall, Y

    S. Schmidgall, Y . Su, Z. Wang, X. Sun, J. Wu, X. Yu, J. Liu, Z. Liu, and E. Barsoum, “Agent laboratory: Using llm agents as research assistants,”arXiv preprint arXiv:2501.04227, 2025

  19. [19]

    Litsearch: A retrieval benchmark for scientific literature search,

    A. Ajith, M. Xia, A. Chevalier, T. Goyal, D. Chen, and T. Gao, “Litsearch: A retrieval benchmark for scientific literature search,”arXiv preprint arXiv:2407.18940, 2024

  20. [20]

    Researcharena: Benchmarking llms’ ability to collect and organize information as research agents,

    H. Kang and C. Xiong, “Researcharena: Benchmarking llms’ ability to collect and organize information as research agents,”arXiv preprint arXiv:2406.10291, 2024

  21. [21]

    Researchagent: Iterative research idea generation over scientific literature with large language models,

    J. Baek, S. K. Jauhar, S. Cucerzan, and S. J. Hwang, “Researchagent: Iterative research idea generation over scientific literature with large language models,”arXiv preprint arXiv:2404.07738, 2024

  22. [22]

    Agentic ai for scientific discovery: A survey of progress, challenges, and future directions,

    M. Gridach, J. Nanavati, K. Z. E. Abidine, L. Mendes, and C. Mack, “Agentic ai for scientific discovery: A survey of progress, challenges, and future directions,”arXiv preprint arXiv:2503.08979, 2025

  23. [23]

    Mdagents: An adaptive collaboration of llms for medical decision-making,

    Y . Kim, C. Park, H. Jeong, Y . S. Chan, X. Xu, D. McDuff, H. Lee, M. Ghassemi, C. Breazeal, H. Parket al., “Mdagents: An adaptive collaboration of llms for medical decision-making,”Advances in Neural Information Processing Systems, vol. 37, pp. 79 410–79 452, 2024

  24. [24]

    Agent-flan: Designing data and methods of effective agent tuning for large language models,

    Z. Chen, K. Liu, Q. Wang, W. Zhang, J. Liu, D. Lin, K. Chen, and F. Zhao, “Agent-flan: Designing data and methods of effective agent tuning for large language models,”arXiv preprint arXiv:2403.12881, 2024

  25. [25]

    arXiv preprint arXiv:2405.02957 , year =

    J. Li, Y . Lai, W. Li, J. Ren, M. Zhang, X. Kang, S. Wang, P. Li, Y .-Q. Zhang, W. Maet al., “Agent hospital: A simulacrum of hospital with evolvable medical agents,”arXiv preprint arXiv:2405.02957, 2024

  26. [26]

    Agent ai: Surveying the horizons of multimodal interaction

    Z. Durante, Q. Huang, N. Wake, R. Gong, J. S. Park, B. Sarkar, R. Taori, Y . Noda, D. Terzopoulos, Y . Choiet al., “Agent ai: Surveying the horizons of multimodal interaction,”arXiv preprint arXiv:2401.03568, 2024

  27. [27]

    arXiv preprint arXiv:2403.03101 , year =

    Y . Zhu, S. Qiao, Y . Ou, S. Deng, S. Lyu, Y . Shen, L. Liang, J. Gu, H. Chen, and N. Zhang, “Knowagent: Knowledge-augmented planning for llm-based agents,”arXiv preprint arXiv:2403.03101, 2024

  28. [28]

    Webvoyager: Building an end-to-end web agent with large multimodal models,

    H. He, W. Yao, K. Ma, W. Yu, Y . Dai, H. Zhang, Z. Lan, and D. Yu, “Webvoyager: Building an end-to-end web agent with large multimodal models,”arXiv preprint arXiv:2401.13919, 2024

  29. [29]

    Polaris: A safety-focused llm constellation architecture for healthcare,

    S. Mukherjee, P. Gamble, M. S. Ausin, N. Kant, K. Aggarwal, N. Manjunath, D. Datta, Z. Liu, J. Ding, S. Busaccaet al., “Polaris: A safety-focused llm constellation architecture for healthcare,”arXiv preprint arXiv:2403.13313, 2024

  30. [30]

    R-judge: Benchmarking safety risk awareness for llm agents,

    T. Yuan, Z. He, L. Dong, Y . Wang, R. Zhao, T. Xia, L. Xu, B. Zhou, F. Li, Z. Zhanget al., “R-judge: Benchmarking safety risk awareness for llm agents,”arXiv preprint arXiv:2401.10019, 2024

  31. [31]

    The application of large language models in primary healthcare services and the challenges,

    W. Y AN, J. HU, H. ZENG, M. LIU, and W. LIANG, “The application of large language models in primary healthcare services and the challenges,”Chinese General Practice, vol. 28, no. 01, p. 1, 2025

  32. [32]

    Aipatient: Simulating patients with ehrs and llm powered agentic workflow,

    H. Yu, J. Zhou, L. Li, S. Chen, J. Gallifant, A. Shi, X. Li, W. Hua, M. Jin, G. Chenet al., “Aipatient: Simulating patients with ehrs and llm powered agentic workflow,”arXiv preprint arXiv:2409.18924, 2024

  33. [33]

    arXiv preprint arXiv:2405.07960 , year =

    S. Schmidgall, R. Ziaei, C. Harris, E. Reis, J. Jopling, and M. Moor, “Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments,”arXiv preprint arXiv:2405.07960, 2024

  34. [34]

    A survey of llm-based agents in medicine: How far are we from baymax?

    W. Wang, Z. Ma, Z. Wang, C. Wu, W. Chen, X. Li, and Y . Yuan, “A survey of llm-based agents in medicine: How far are we from baymax?” arXiv preprint arXiv:2502.11211, 2025

  35. [35]

    From prompt injections to protocol exploits: Threats in llm-powered ai agents workflows,

    M. A. Ferrag, N. Tihanyi, D. Hamouda, L. Maglaras, and M. Debbah, “From prompt injections to protocol exploits: Threats in llm-powered ai agents workflows,”arXiv preprint arXiv:2506.23260, 2025

  36. [36]

    Exe- cutable code actions elicit better llm agents,

    X. Wang, Y . Chen, L. Yuan, Y . Zhang, Y . Li, H. Peng, and H. Ji, “Exe- cutable code actions elicit better llm agents,” inForty-first International Conference on Machine Learning, 2024

  37. [37]

    Reflexion: Language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” Advances in Neural Information Processing Systems, vol. 36, pp. 8634– 8652, 2023

  38. [38]

    React: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023

  39. [39]

    Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023

    A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y .-X. Wang, “Language agent tree search unifies reasoning acting and planning in language models,”arXiv preprint arXiv:2310.04406, 2023

  40. [40]

    Learn-by-interact: A data-centric framework for self-adaptive agents in realistic environments,

    H. Su and Others, “Learn-by-interact: A data-centric framework for self-adaptive agents in realistic environments,”arXiv preprint arXiv:2501.10893, 2025

  41. [41]

    Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation,

    M. Hu, P. Zhao, C. Xu, Q. Sun, J. Lou, Q. Lin, P. Luo, and S. Ra- jmohan, “Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation,”arXiv preprint arXiv:2408.00764, 2024

  42. [42]

    Agenttuning: Enabling generalized agent abilities for llms,

    A. Zeng, M. Liu, R. Lu, B. Wang, X. Liu, Y . Dong, and J. Tang, “Agenttuning: Enabling generalized agent abilities for llms,”arXiv preprint arXiv:2310.12823, 2023

  43. [43]

    Reinforced Self-Training (ReST) for Language Modeling

    C. Gulcehre, T. L. Paine, S. Srinivasan, K. Konyushkova, L. Weerts, A. Sharma, A. Siddhant, A. Ahern, M. Wang, C. Guet al., “Re- inforced self-training (rest) for language modeling,”arXiv preprint arXiv:2308.08998, 2023

  44. [44]

    Rest meets react: Self-improvement for multi-step reasoning llm agent,

    R. Aksitov, S. Miryoosefi, Z. Li, D. Li, S. Babayan, K. Kopparapu, Z. Fisher, R. Guo, S. Prakash, P. Srinivasanet al., “Rest meets react: Self-improvement for multi-step reasoning llm agent,”arXiv preprint arXiv:2312.10003, 2023

  45. [45]

    Large Language Model based Multi-Agents: A Survey of Progress and Challenges

    T. Guo, X. Chen, Y . Wang, R. Chang, S. Pei, N. V . Chawla, O. Wiest, and X. Zhang, “Large language model based multi-agents: A survey of progress and challenges,”arXiv preprint arXiv:2402.01680, 2024

  46. [46]

    Synthetic data generation & multi-step rl for reasoning & tool use,

    A. Goldie, A. Mirhoseini, H. Zhou, I. Cai, and C. D. Manning, “Synthetic data generation & multi-step rl for reasoning & tool use,” arXiv preprint arXiv:2504.04736, 2025

  47. [47]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    S. Hong, X. Zheng, J. Chen, Y . Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhouet al., “Metagpt: Meta programming for multi-agent collaborative framework,”arXiv preprint arXiv:2308.00352, vol. 3, no. 4, p. 6, 2023

  48. [48]

    ChatDev: Communicative Agents for Software Development

    C. Qian, X. Cong, C. Yang, W. Chen, Y . Su, J. Xu, Z. Liu, and M. Sun, “Communicative agents for software development,”arXiv preprint arXiv:2307.07924, vol. 6, no. 3, 2023

  49. [49]

    Roco: Dialectic multi-robot col- laboration with large language models,

    Z. Mandi, S. Jain, and S. Song, “Roco: Dialectic multi-robot col- laboration with large language models,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 286–299

  50. [50]

    Building cooperative embodied agents modularly with large language models

    H. Zhang, W. Du, J. Shan, Q. Zhou, Y . Du, J. B. Tenenbaum, T. Shu, and C. Gan, “Building cooperative embodied agents modularly with large language models,”arXiv preprint arXiv:2307.02485, 2023

  51. [51]

    Generative agents: Interactive simulacra of human behavior,

    J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” inProceedings of the 36th annual acm symposium on user interface software and technology, 2023, pp. 1–22. 42

  52. [52]

    Simulating public administration crisis: A novel generative agent-based simulation system to lower technology barriers in social science research,

    B. Xiao, Z. Yin, and Z. Shan, “Simulating public administration crisis: A novel generative agent-based simulation system to lower technology barriers in social science research,”arXiv preprint arXiv:2311.06957, 2023

  53. [53]

    Avalon’s game of thoughts: Battle against deception through recursive contemplation,

    S. Wang, C. Liu, Z. Zheng, S. Qi, S. Chen, Q. Yang, A. Zhao, C. Wang, S. Song, and G. Huang, “Avalon’s game of thoughts: Battle against deception through recursive contemplation,”arXiv preprint arXiv:2310.01320, 2023

  54. [54]

    Agents in software engineering: Survey, landscape, and vision,

    Y . Wang, W. Zhong, Y . Huang, E. Shi, M. Yang, J. Chen, H. Li, Y . Ma, Q. Wang, and Z. Zheng, “Agents in software engineering: Survey, landscape, and vision,”arXiv preprint arXiv:2409.09030, 2024

  55. [55]

    From llms to llm- based agents for software engineering: A survey of current, challenges and future,

    H. Jin, L. Huang, H. Cai, J. Yan, B. Li, and H. Chen, “From llms to llm- based agents for software engineering: A survey of current, challenges and future,”arXiv preprint arXiv:2408.02479, 2024

  56. [56]

    Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

    A. Singh, A. Ehtesham, S. Kumar, and T. T. Khoei, “Agentic retrieval- augmented generation: A survey on agentic rag,”arXiv preprint arXiv:2501.09136, 2025

  57. [57]

    Survey on evaluation of llm-based agents,

    A. Yehudai, L. Eden, A. Li, G. Uziel, Y . Zhao, R. Bar-Haim, A. Cohan, and M. Shmueli-Scheuer, “Survey on evaluation of llm-based agents,”

  58. [58]

    Survey on Evaluation of LLM-based Agents

    [Online]. Available: https://arxiv.org/abs/2503.16416

  59. [59]

    Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    Q. Chen, L. Qin, J. Liu, D. Peng, J. Guan, P. Wang, M. Hu, Y . Zhou, T. Gao, and W. Che, “Towards reasoning era: A survey of long chain-of-thought for reasoning large language models,”arXiv preprint arXiv:2503.09567, 2025

  60. [60]

    arXiv preprint arXiv:2502.14321 , year =

    B. Yan, X. Zhang, L. Zhang, L. Zhang, Z. Zhou, D. Miao, and C. Li, “Beyond self-talk: A communication-centric survey of llm-based multi- agent systems,”arXiv preprint arXiv:2502.14321, 2025

  61. [61]

    A survey on large language model-based social agents in game-theoretic scenarios,

    X. Feng, L. Dou, E. Li, Q. Wang, H. Wang, Y . Guo, C. Ma, and L. Kong, “A survey on large language model-based social agents in game-theoretic scenarios,”arXiv preprint arXiv:2412.03920, 2024

  62. [62]

    Large language model-brained gui agents: A survey,

    C. Zhang, S. He, J. Qian, B. Li, L. Li, S. Qin, Y . Kang, M. Ma, G. Liu, Q. Linet al., “Large language model-brained gui agents: A survey,” arXiv preprint arXiv:2411.18279, 2024

  63. [63]

    Personal llm agents: Insights and survey about the capability, efficiency and security

    Y . Li, H. Wen, W. Wang, X. Li, Y . Yuan, G. Liu, J. Liu, W. Xu, X. Wang, Y . Sunet al., “Personal llm agents: Insights and sur- vey about the capability, efficiency and security,”arXiv preprint arXiv:2401.05459, 2024

  64. [64]

    A review of large language models and autonomous agents in chemistry,

    M. C. Ramos, C. J. Collison, and A. D. White, “A review of large language models and autonomous agents in chemistry,”Chemical Science, 2025

  65. [65]

    and Lee, Dean and Menghini, Cristina and others , year=

    C. J. Wang, D. Lee, C. Menghini, J. Mols, J. Doughty, A. Khoja, J. Lynch, S. Hendryx, S. Yue, and D. Hendrycks, “Enigmaeval: A benchmark of long multimodal reasoning challenges,”arXiv preprint arXiv:2502.08859, 2025

  66. [66]

    Measuring Massive Multitask Language Understanding

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” arXiv preprint arXiv:2009.03300, 2020

  67. [67]

    Complexfuncbench: Exploring multi-step and constrained function calling under long- context scenario,

    L. Zhong, Z. Du, X. Zhang, H. Hu, and J. Tang, “Complexfuncbench: Exploring multi-step and constrained function calling under long- context scenario,”arXiv preprint arXiv:2501.10132, 2025

  68. [68]

    Humanity's Last Exam

    L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, S. Shi, M. Choi, A. Agrawal, A. Chopraet al., “Humanity’s last exam,”arXiv preprint arXiv:2501.14249, 2025

  69. [69]

    Facts & grounding: A new benchmark for evaluating the factuality of large language models,

    DeepMind, “Facts & grounding: A new benchmark for evaluating the factuality of large language models,” 2023, accessed: 2025- 02-03. [Online]. Available: https://deepmind.google/discover/blog/ facts-grounding-a-new-benchmark-for-evaluating-the-factuality-of-\ large-language-models/

  70. [70]

    Processbench: Identifying process errors in mathematical reasoning,

    C. Zheng, Z. Zhang, B. Zhang, R. Lin, K. Lu, B. Yu, D. Liu, J. Zhou, and J. Lin, “Processbench: Identifying process errors in mathematical reasoning,”arXiv preprint arXiv:2412.06559, 2024

  71. [71]

    Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations,

    L. Ouyang, Y . Qu, H. Zhou, J. Zhu, R. Zhang, Q. Lin, B. Wang, Z. Zhao, M. Jiang, X. Zhaoet al., “Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations,”arXiv preprint arXiv:2412.07626, 2024

  72. [72]

    Does this step contain a critical error? Answer with only ‘yes’ or ‘no’

    M. Zhuge, C. Zhao, D. Ashley, W. Wang, D. Khizbullin, Y . Xiong, Z. Liu, E. Chang, R. Krishnamoorthi, Y . Tianet al., “Agent-as-a-judge: Evaluate agents with agents,”arXiv preprint arXiv:2410.10934, 2024

  73. [73]

    Judgebench: A benchmark for evaluating llm-based judges,

    S. Tan, S. Zhuang, K. Montgomery, W. Y . Tang, A. Cuadron, C. Wang, R. A. Popa, and I. Stoica, “Judgebench: A benchmark for evaluating llm-based judges,”arXiv preprint arXiv:2410.12784, 2024

  74. [74]

    Introducing simpleqa,

    OpenAI, “Introducing simpleqa,” 2024, accessed: 2025-02-03. [Online]. Available: https://openai.com/index/introducing-simpleqa/

  75. [75]

    Fine tasks,

    HuggingFaceFW, “Fine tasks,” 2024, accessed: 2025-02-03. [Online]. Available: https://huggingface.co/spaces/HuggingFaceFW/ blogpost-fine-tasks

  76. [76]

    Krishna, K

    S. Krishna, K. Krishna, A. Mohananey, S. Schwarcz, A. Stam- bler, S. Upadhyay, and M. Faruqui, “Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation,”arXiv preprint arXiv:2409.12941, 2024

  77. [77]

    Dabstep,

    Hugging Face, “Dabstep,” 2025, accessed: 2025-02-03. [Online]. Available: https://huggingface.co/blog/dabstep

  78. [78]

    Bfcl v2 live,

    H. Mao, C. C.-J. Ji, F. Yan, T. Zhang, and S. G. Patil, “Bfcl v2 live,” https://gorilla.cs.berkeley.edu/blogs/12 bfcl v2 live.html, 2024, accessed: February 16, 2025

  79. [79]

    Swe-lancer: Can frontier llms earn $1 million from real world freelance software engineering?

    S. Miserendino, M. Wang, T. Patwardhan, and J. Heidecke, “Swe-lancer: Can frontier llms earn $1 million from real world freelance software engineering?” 2025. [Online]. Available: https: //arxiv.org/abs/2502.12115

  80. [80]

    Crag–comprehensive rag benchmark,

    X. Yang, K. Sun, H. Xin, Y . Sun, N. Bhalla, X. Chen, S. Choudhary, R. D. Gui, Z. W. Jiang, Z. Jianget al., “Crag–comprehensive rag benchmark,”arXiv preprint arXiv:2406.04744, 2024

Showing first 80 references.