arxiv: 2504.19678 · v2 · submitted 2025-04-28 · 💻 cs.AI · cs.LG

Recognition: 1 theorem link

From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

Mohamed Amine Ferrag , Norbert Tihanyi , Merouane Debbah

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:53 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords LLM benchmarksAI agent frameworksevaluation taxonomymulti-agent collaborationreasoning taskscode generation benchmarksmultimodal evaluationautonomous agents

0 comments

The pith

A review organizes roughly 60 benchmarks for large language models and autonomous agents into one taxonomy covering reasoning, code, and real-world tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper consolidates scattered benchmarks and frameworks for large language models and AI agents that have appeared since 2019. It lines up side-by-side comparisons and groups the tests into categories such as mathematical problem-solving, code generation, factual retrieval, multimodal tasks, and interactive assessments. The work also surveys agent frameworks from 2023 onward, lists applications in science and industry, and examines protocols that let agents communicate with one another. A reader would care because the growing number of models and tools now requires clearer maps to decide which evaluations actually measure useful progress. The review ends by pointing to open questions around advanced reasoning, failure modes, and security in multi-agent systems.

Core claim

The landscape of evaluation benchmarks and AI-agent frameworks remains fragmented, and a unified taxonomy of approximately sixty benchmarks developed between 2019 and 2025 supplies the missing structure. These benchmarks are grouped under general knowledge reasoning, mathematical problem-solving, code generation and software engineering, factual grounding, domain-specific tests, multimodal and embodied tasks, task orchestration, and interactive assessments. The paper further reviews modular agent frameworks, real-world deployments in materials science through finance, and three agent-to-agent collaboration protocols.

What carries the argument

The proposed taxonomy of approximately 60 benchmarks that groups evaluations into eight domain categories and enables side-by-side comparison across years.

If this is right

Researchers gain a single reference to compare model performance across mathematical, code, and embodied domains.
Agent frameworks can be evaluated consistently against the same set of orchestration and interaction tests.
Real-world applications in healthcare, finance, and materials science receive clearer benchmark baselines.
Collaboration protocols such as ACP, MCP, and A2A become easier to test against shared failure-mode criteria.
Future work on reinforcement-learning tool integration and automated scientific discovery can build on the identified gaps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy could seed a living public repository that adds new benchmarks automatically as they appear.
Security testing for agent protocols may need its own dedicated category to keep pace with deployment risks.
Connections between the embodied-task benchmarks and physical robotics experiments remain untested in the current grouping.
Dynamic tool use via reinforcement learning could be evaluated by extending the existing task-orchestration category.

Load-bearing premise

The selected benchmarks and frameworks are assumed to cover the main landscape without large omissions or selection bias.

What would settle it

A complete inventory of all benchmarks published 2019-2025 that reveals many omitted from the taxonomy or shows that the eight-category grouping misses major overlaps or gaps.

read the original abstract

Large language models and autonomous AI agents have evolved rapidly, resulting in a diverse array of evaluation benchmarks, frameworks, and collaboration protocols. Driven by the growing need for standardized evaluation and integration, we systematically consolidate these fragmented efforts into a unified framework. However, the landscape remains fragmented and lacks a unified taxonomy or comprehensive survey. Therefore, we present a side-by-side comparison of benchmarks developed between 2019 and 2025 that evaluate these models and agents across multiple domains. In addition, we propose a taxonomy of approximately 60 benchmarks that cover general and academic knowledge reasoning, mathematical problem-solving, code generation and software engineering, factual grounding and retrieval, domain-specific evaluations, multimodal and embodied tasks, task orchestration, and interactive assessments. Furthermore, we review AI-agent frameworks introduced between 2023 and 2025 that integrate large language models with modular toolkits to enable autonomous decision-making and multi-step reasoning. Moreover, we present real-world applications of autonomous AI agents in materials science, biomedical research, academic ideation, software engineering, synthetic data generation, chemical reasoning, mathematical problem-solving, geographic information systems, multimedia, healthcare, and finance. We then survey key agent-to-agent collaboration protocols, namely the Agent Communication Protocol (ACP), the Model Context Protocol (MCP), and the Agent-to-Agent Protocol (A2A). Finally, we discuss recommendations for future research, focusing on advanced reasoning strategies, failure modes in multi-agent LLM systems, automated scientific discovery, dynamic tool integration via reinforcement learning, integrated search capabilities, and security vulnerabilities in agent protocols.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey organizes LLM agent benchmarks into a taxonomy but does not detail how the benchmarks were selected.

read the letter

This paper is a review that gathers benchmarks for evaluating LLM reasoning and autonomous AI agents. It proposes a taxonomy covering approximately 60 benchmarks developed between 2019 and 2025, organized into categories such as general and academic knowledge reasoning, mathematical problem-solving, code generation, factual grounding, domain-specific tasks, multimodal and embodied activities, task orchestration, and interactive assessments. The authors also survey agent frameworks from 2023 to 2025, real-world applications in fields like materials science, healthcare, and finance, and protocols for agent collaboration. The paper does well in bringing these elements together. The structured taxonomy and comparisons provide a map that could help standardize evaluations in this fast-moving area. Including applications and protocols gives a fuller picture beyond just benchmarks. However, the central taxonomy rests on an unspecified selection process. There is no mention of search strategies, databases, keywords, or inclusion rules. This absence makes it hard to verify if the set of 60 is exhaustive or representative, or if certain areas received more attention due to the authors' focus. For a survey claiming to unify the landscape, this is a noticeable limitation that affects how much weight the taxonomy can carry. This work is suited for researchers and practitioners in AI agents looking for an overview of existing evaluation tools and frameworks. It supports finding relevant benchmarks without starting from scratch. I would cite the taxonomy section if mapping out evaluation options in my own work. It should go through peer review because the organization is useful and the field needs such references, even with the need for added methodological clarity.

Referee Report

2 major / 1 minor

Summary. The paper claims to systematically consolidate fragmented efforts in LLM reasoning and autonomous AI agents into a unified framework. It provides a side-by-side comparison and proposes a taxonomy of approximately 60 benchmarks (2019–2025) covering general/academic knowledge, mathematical problem-solving, code generation, factual grounding, domain-specific tasks, multimodal/embodied work, task orchestration, and interactive assessments. It further reviews 2023–2025 agent frameworks, real-world applications across materials science, biomedicine, software engineering and other domains, agent-to-agent protocols (ACP, MCP, A2A), and offers future research recommendations on reasoning strategies, failure modes, and security.

Significance. A reproducible, exhaustive taxonomy and comparison of recent benchmarks would be a useful consolidation for a fast-moving field, reducing duplication and aiding standardized evaluation. The review of frameworks, applications, and protocols adds practical value if the coverage is representative rather than selective.

major comments (2)

[Abstract / Taxonomy Proposal] The central claim of a 'unified taxonomy' of ~60 benchmarks (abstract and taxonomy section) rests on an unspecified selection process. No search strategy, databases, keywords, date ranges, or inclusion/exclusion criteria are provided, nor is the number of screened versus included items reported. This directly undermines verifiability of exhaustiveness across the listed domains.
[Benchmark Comparison] The side-by-side benchmark comparison (abstract) cannot be assessed for balance or bias without the methodology details above; domain coverage (e.g., multimodal vs. code generation) may reflect author preference rather than literature distribution, weakening the unification argument.

minor comments (1)

[Review of Frameworks] Clarify the exact time windows used for frameworks (2023–2025) versus benchmarks (2019–2025) and ensure consistent citation of all included works.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on methodological transparency. We agree that explicit details on the literature search process are needed to support claims of exhaustiveness and will revise the manuscript to address this.

read point-by-point responses

Referee: [Abstract / Taxonomy Proposal] The central claim of a 'unified taxonomy' of ~60 benchmarks (abstract and taxonomy section) rests on an unspecified selection process. No search strategy, databases, keywords, date ranges, or inclusion/exclusion criteria are provided, nor is the number of screened versus included items reported. This directly undermines verifiability of exhaustiveness across the listed domains.

Authors: We acknowledge that the current manuscript does not include an explicit description of the search methodology. The taxonomy was compiled via a broad review of arXiv, Google Scholar, and major NLP venues for papers from 2019–2025 using keywords such as 'LLM benchmark', 'agent evaluation', 'multimodal reasoning', and 'autonomous agent framework'. In the revised version we will add a 'Literature Search Methodology' subsection (with a PRISMA-style flow) reporting databases, exact keywords, date filters, inclusion criteria (empirical evaluations of LLM-based agents or benchmarks), and screened/included counts to enable verification of coverage. revision: yes
Referee: [Benchmark Comparison] The side-by-side benchmark comparison (abstract) cannot be assessed for balance or bias without the methodology details above; domain coverage (e.g., multimodal vs. code generation) may reflect author preference rather than literature distribution, weakening the unification argument.

Authors: We agree that balance cannot be fully evaluated without the search details. Once the methodology section is added, we will also include a brief discussion of domain prevalence in the retrieved literature and note any areas where coverage is sparser (e.g., certain embodied tasks). The comparison table will be framed as a representative sample drawn from the identified works rather than an exhaustive census, thereby reducing the risk of perceived selection bias. revision: yes

Circularity Check

0 steps flagged

No circularity in survey taxonomy or framework proposal

full rationale

This is a literature review paper that aggregates and organizes existing benchmarks, frameworks, and protocols from 2019-2025 without any mathematical derivations, fitted parameters, or equations. The proposed taxonomy of ~60 benchmarks is presented as an organizational synthesis of cited prior work rather than a quantity derived from self-referential inputs or self-citations. No load-bearing step reduces to a tautology, self-definition, or author-specific uniqueness theorem; the central claims rest on external citations and descriptive coverage rather than internal construction. The absence of a detailed search protocol is a methodological limitation but does not create circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the paper is a synthesis of prior published work on LLM agents and benchmarks.

pith-pipeline@v0.9.0 · 5584 in / 1053 out tokens · 40369 ms · 2026-05-15T02:53:18.046467+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Tools as Continuous Flow for Evolving Agentic Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

FlowAgent models tool chaining as continuous latent trajectory generation with conditional flow matching to deliver global planning, formal utility bounds, and better robustness on long-horizon tasks, plus a new plan-...
Token Warping Helps MLLMs Look from Nearby Viewpoints
cs.CV 2026-04 unverdicted novelty 7.0

Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.
Exploiting LLM Agent Supply Chains via Payload-less Skills
cs.CR 2026-05 conditional novelty 6.0

Semantic Compliance Hijacking lets attackers hijack LLM agents by disguising malicious instructions as compliance rules in skills, reaching up to 77.67% success on confidentiality breaches and 67.33% on RCE while evad...
Position: Assistive Agents Need Accessibility Alignment
cs.AI 2026-05 conditional novelty 6.0

Assistive agents for BVI users need accessibility alignment as a core design goal, with a proposed lifecycle pipeline, because sighted assumptions cause unfixable failures in verification, risk, and interaction.
STAR: Failure-Aware Markovian Routing for Multi-Agent Spatiotemporal Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

STAR combines expert nominal routes with trace-learned recovery transitions in a failure-typed routing matrix, improving multi-agent spatiotemporal reasoning over baselines especially on error-deviating queries.
Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation
cs.AI 2026-05 unverdicted novelty 6.0

A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 ...
LATTICE: Evaluating Decision Support Utility of Crypto Agents
cs.CR 2026-04 unverdicted novelty 6.0

LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.
QuantClaw: Precision Where It Matters for OpenClaw
cs.AI 2026-04 unverdicted novelty 6.0

QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
Understanding the Mechanism of Altruism in Large Language Models
econ.GN 2026-04 unverdicted novelty 6.0

A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization
cs.AI 2026-04 unverdicted novelty 6.0

EvoOR-Agent co-evolves agent architectures as AOE-style networks with graph-mediated recombination and knowledge-base-assisted mutation to outperform fixed LLM pipelines on OR benchmarks.
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
cs.IR 2026-04 unverdicted novelty 6.0

MemSearch-o1 uses reasoning-aligned memory growth from seed tokens, retracing via contribution functions, and path reorganization to mitigate memory dilution in LLM agentic search.
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
cs.IR 2026-04 unverdicted novelty 6.0

MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.
Agentic Frameworks for Reasoning Tasks: An Empirical Study
cs.AI 2026-04 unverdicted novelty 6.0

An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.
AgentComm: Semantic Communication for Embodied Agents
eess.SP 2026-04 unverdicted novelty 6.0

AgentComm achieves nearly 50% bandwidth reduction in embodied agent communication via LLM semantic processing, importance-aware transmission, and a task knowledge base, with negligible impact on task completion.
STAR: Failure-Aware Markovian Routing for Multi-Agent Spatiotemporal Reasoning
cs.AI 2026-05 unverdicted novelty 5.0

STAR is a failure-aware Markovian router that learns recovery transitions from both successful and unsuccessful execution traces to improve multi-agent performance on spatiotemporal benchmarks.
A Low-Latency Fraud Detection Layer for Detecting Adversarial Interaction Patterns in LLM-Powered Agents
cs.AI 2026-05 unverdicted novelty 5.0

Researchers developed a fast XGBoost-based detector using 42 runtime features to spot adversarial interaction patterns in LLM agents, running over 9 times faster than LLM detectors on synthetic multi-turn data.
AgentDID: Trustless Identity Authentication for AI Agents
cs.CR 2026-04 unverdicted novelty 5.0

AgentDID is a W3C-compliant decentralized identity system for AI agents enabling self-managed authentication and state verification via challenge-response.
Intention-Aware Semantic Agent Communications for AI Glasses
eess.SP 2026-04 unverdicted novelty 5.0

An intention-aware semantic agent system for AI glasses reduces bandwidth by over 50% in simulations while preserving task performance through adaptive preprocessing guided by inferred user intentions.
Multi-Agent Systems: From Classical Paradigms to Large Foundation Model-Enabled Futures
cs.AI 2026-04 unverdicted novelty 4.0

A survey comparing classical multi-agent systems with large foundation model-enabled multi-agent systems, showing how the latter enables semantic-level collaboration and greater adaptability.
A Survey of Context Engineering for Large Language Models
cs.CL 2025-07 accept novelty 4.0

The survey organizes Context Engineering into retrieval, processing, management, and integrated systems like RAG and multi-agent setups while identifying an asymmetry where LLMs handle complex inputs well but struggle...
Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG
cs.AI 2025-01 unverdicted novelty 4.0

Agentic RAG embeds agents with reflection, planning, tool use, and collaboration into retrieval pipelines to overcome static RAG limitations, and the survey offers a taxonomy by agent count, control, autonomy, and kno...
A Review of Large Language Models for Stock Price Forecasting from a Hedge-Fund Perspective
q-fin.PR 2026-04 unverdicted novelty 3.0

This review synthesizes LLM uses in stock forecasting and catalogs key practical pitfalls from a hedge-fund viewpoint.

Reference graph

Works this paper leans on

236 extracted references · 236 canonical work pages · cited by 20 Pith papers · 34 internal anchors

[1]

OpenAI o1 System Card

A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carneyet al., “Openai o1 system card,”arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Qwen2.5-Omni Technical Report

J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-omni technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2503.20215

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Understanding the planning of LLM agents: A survey

X. Huang, W. Liu, X. Chen, X. Wang, H. Wang, D. Lian, Y . Wang, R. Tang, and E. Chen, “Understanding the planning of llm agents: A survey,”arXiv preprint arXiv:2402.02716, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

A Survey on LLM-as-a-Judge

J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liuet al., “A survey on llm-as-a-judge,”arXiv preprint arXiv:2411.15594, 2024. 41

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

I know which llm wrote your code last summer: Llm generated code stylometry for authorship attribution,

T. Bisztray, B. Cherif, R. A. Dubniczky, N. Gruschka, B. Borsos, M. A. Ferrag, A. Kovacs, V . Mavroeidis, and N. Tihanyi, “I know which llm wrote your code last summer: Llm generated code stylometry for authorship attribution,”arXiv preprint arXiv:2506.17323, 2025

work page arXiv 2025
[8]

Vidorag: Visual document retrieval-augmented generation via dynamic iterative reasoning agents,

Q. Wang, R. Ding, Z. Chen, W. Wu, S. Wang, P. Xie, and F. Zhao, “Vidorag: Visual document retrieval-augmented generation via dynamic iterative reasoning agents,”arXiv preprint arXiv:2502.18017, 2025

work page arXiv 2025
[9]

Benchmarking multimodal retrieval augmented generation with dynamic vqa dataset and self-adaptive planning agent,

Y . Li, Y . Li, X. Wang, Y . Jiang, Z. Zhang, X. Zheng, H. Wang, H.-T. Zheng, P. Xie, P. S. Yuet al., “Benchmarking multimodal retrieval augmented generation with dynamic vqa dataset and self-adaptive planning agent,”arXiv preprint arXiv:2411.02937, 2024

work page arXiv 2024
[10]

Rag-kg-il: A multi-agent hybrid framework for reducing hallucinations and enhancing llm reasoning through rag and incremental knowledge graph learning integration,

H. Q. Yu and F. McQuade, “Rag-kg-il: A multi-agent hybrid framework for reducing hallucinations and enhancing llm reasoning through rag and incremental knowledge graph learning integration,”arXiv preprint arXiv:2503.13514, 2025

work page arXiv 2025
[11]

Bioragent: A retrieval-augmented gener- ation system for showcasing generative query expansion and domain- specific search for scientific q&a,

S. Ateia and U. Kruschwitz, “Bioragent: A retrieval-augmented gener- ation system for showcasing generative query expansion and domain- specific search for scientific q&a,”arXiv preprint arXiv:2412.12358, 2024

work page arXiv 2024
[12]

Retrieval-augmented simulacra: Generative agents for up-to-date and knowledge-adaptive simulations,

H. Shimadzu, T. Utsuro, and D. Kitayama, “Retrieval-augmented simulacra: Generative agents for up-to-date and knowledge-adaptive simulations,”arXiv preprint arXiv:2503.14620, 2025

work page arXiv 2025
[13]

Rag-gym: Optimizing reasoning and search agents with process supervision,

G. Xiong, Q. Jin, X. Wang, Y . Fang, H. Liu, Y . Yang, F. Chen, Z. Song, D. Wang, M. Zhanget al., “Rag-gym: Optimizing reasoning and search agents with process supervision,”arXiv preprint arXiv:2502.13957, 2025

work page arXiv 2025
[14]

Reasoning beyond limits: Advances and open problems for llms,

M. A. Ferrag, N. Tihanyi, and M. Debbah, “Reasoning beyond limits: Advances and open problems for llms,” 2025. [Online]. Available: https://arxiv.org/abs/2503.22732

work page arXiv 2025
[15]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millicanet al., “Gem- ini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Schmidgall, Y

S. Schmidgall, Y . Su, Z. Wang, X. Sun, J. Wu, X. Yu, J. Liu, Z. Liu, and E. Barsoum, “Agent laboratory: Using llm agents as research assistants,”arXiv preprint arXiv:2501.04227, 2025

work page arXiv 2025
[19]

Litsearch: A retrieval benchmark for scientific literature search,

A. Ajith, M. Xia, A. Chevalier, T. Goyal, D. Chen, and T. Gao, “Litsearch: A retrieval benchmark for scientific literature search,”arXiv preprint arXiv:2407.18940, 2024

work page arXiv 2024
[20]

Researcharena: Benchmarking llms’ ability to collect and organize information as research agents,

H. Kang and C. Xiong, “Researcharena: Benchmarking llms’ ability to collect and organize information as research agents,”arXiv preprint arXiv:2406.10291, 2024

work page arXiv 2024
[21]

Researchagent: Iterative research idea generation over scientific literature with large language models,

J. Baek, S. K. Jauhar, S. Cucerzan, and S. J. Hwang, “Researchagent: Iterative research idea generation over scientific literature with large language models,”arXiv preprint arXiv:2404.07738, 2024

work page arXiv 2024
[22]

Agentic ai for scientific discovery: A survey of progress, challenges, and future directions,

M. Gridach, J. Nanavati, K. Z. E. Abidine, L. Mendes, and C. Mack, “Agentic ai for scientific discovery: A survey of progress, challenges, and future directions,”arXiv preprint arXiv:2503.08979, 2025

work page arXiv 2025
[23]

Mdagents: An adaptive collaboration of llms for medical decision-making,

Y . Kim, C. Park, H. Jeong, Y . S. Chan, X. Xu, D. McDuff, H. Lee, M. Ghassemi, C. Breazeal, H. Parket al., “Mdagents: An adaptive collaboration of llms for medical decision-making,”Advances in Neural Information Processing Systems, vol. 37, pp. 79 410–79 452, 2024

work page 2024
[24]

Agent-flan: Designing data and methods of effective agent tuning for large language models,

Z. Chen, K. Liu, Q. Wang, W. Zhang, J. Liu, D. Lin, K. Chen, and F. Zhao, “Agent-flan: Designing data and methods of effective agent tuning for large language models,”arXiv preprint arXiv:2403.12881, 2024

work page arXiv 2024
[25]

arXiv preprint arXiv:2405.02957 , year =

J. Li, Y . Lai, W. Li, J. Ren, M. Zhang, X. Kang, S. Wang, P. Li, Y .-Q. Zhang, W. Maet al., “Agent hospital: A simulacrum of hospital with evolvable medical agents,”arXiv preprint arXiv:2405.02957, 2024

work page arXiv 2024
[26]

Agent ai: Surveying the horizons of multimodal interaction

Z. Durante, Q. Huang, N. Wake, R. Gong, J. S. Park, B. Sarkar, R. Taori, Y . Noda, D. Terzopoulos, Y . Choiet al., “Agent ai: Surveying the horizons of multimodal interaction,”arXiv preprint arXiv:2401.03568, 2024

work page arXiv 2024
[27]

arXiv preprint arXiv:2403.03101 , year =

Y . Zhu, S. Qiao, Y . Ou, S. Deng, S. Lyu, Y . Shen, L. Liang, J. Gu, H. Chen, and N. Zhang, “Knowagent: Knowledge-augmented planning for llm-based agents,”arXiv preprint arXiv:2403.03101, 2024

work page arXiv 2024
[28]

Webvoyager: Building an end-to-end web agent with large multimodal models,

H. He, W. Yao, K. Ma, W. Yu, Y . Dai, H. Zhang, Z. Lan, and D. Yu, “Webvoyager: Building an end-to-end web agent with large multimodal models,”arXiv preprint arXiv:2401.13919, 2024

work page arXiv 2024
[29]

Polaris: A safety-focused llm constellation architecture for healthcare,

S. Mukherjee, P. Gamble, M. S. Ausin, N. Kant, K. Aggarwal, N. Manjunath, D. Datta, Z. Liu, J. Ding, S. Busaccaet al., “Polaris: A safety-focused llm constellation architecture for healthcare,”arXiv preprint arXiv:2403.13313, 2024

work page arXiv 2024
[30]

R-judge: Benchmarking safety risk awareness for llm agents,

T. Yuan, Z. He, L. Dong, Y . Wang, R. Zhao, T. Xia, L. Xu, B. Zhou, F. Li, Z. Zhanget al., “R-judge: Benchmarking safety risk awareness for llm agents,”arXiv preprint arXiv:2401.10019, 2024

work page arXiv 2024
[31]

The application of large language models in primary healthcare services and the challenges,

W. Y AN, J. HU, H. ZENG, M. LIU, and W. LIANG, “The application of large language models in primary healthcare services and the challenges,”Chinese General Practice, vol. 28, no. 01, p. 1, 2025

work page 2025
[32]

Aipatient: Simulating patients with ehrs and llm powered agentic workflow,

H. Yu, J. Zhou, L. Li, S. Chen, J. Gallifant, A. Shi, X. Li, W. Hua, M. Jin, G. Chenet al., “Aipatient: Simulating patients with ehrs and llm powered agentic workflow,”arXiv preprint arXiv:2409.18924, 2024

work page arXiv 2024
[33]

arXiv preprint arXiv:2405.07960 , year =

S. Schmidgall, R. Ziaei, C. Harris, E. Reis, J. Jopling, and M. Moor, “Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments,”arXiv preprint arXiv:2405.07960, 2024

work page arXiv 2024
[34]

A survey of llm-based agents in medicine: How far are we from baymax?

W. Wang, Z. Ma, Z. Wang, C. Wu, W. Chen, X. Li, and Y . Yuan, “A survey of llm-based agents in medicine: How far are we from baymax?” arXiv preprint arXiv:2502.11211, 2025

work page arXiv 2025
[35]

From prompt injections to protocol exploits: Threats in llm-powered ai agents workflows,

M. A. Ferrag, N. Tihanyi, D. Hamouda, L. Maglaras, and M. Debbah, “From prompt injections to protocol exploits: Threats in llm-powered ai agents workflows,”arXiv preprint arXiv:2506.23260, 2025

work page arXiv 2025
[36]

Exe- cutable code actions elicit better llm agents,

X. Wang, Y . Chen, L. Yuan, Y . Zhang, Y . Li, H. Peng, and H. Ji, “Exe- cutable code actions elicit better llm agents,” inForty-first International Conference on Machine Learning, 2024

work page 2024
[37]

Reflexion: Language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” Advances in Neural Information Processing Systems, vol. 36, pp. 8634– 8652, 2023

work page 2023
[38]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023

work page 2023
[39]

Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023

A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y .-X. Wang, “Language agent tree search unifies reasoning acting and planning in language models,”arXiv preprint arXiv:2310.04406, 2023

work page arXiv 2023
[40]

Learn-by-interact: A data-centric framework for self-adaptive agents in realistic environments,

H. Su and Others, “Learn-by-interact: A data-centric framework for self-adaptive agents in realistic environments,”arXiv preprint arXiv:2501.10893, 2025

work page arXiv 2025
[41]

Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation,

M. Hu, P. Zhao, C. Xu, Q. Sun, J. Lou, Q. Lin, P. Luo, and S. Ra- jmohan, “Agentgen: Enhancing planning abilities for large language model based agent via environment and task generation,”arXiv preprint arXiv:2408.00764, 2024

work page arXiv 2024
[42]

Agenttuning: Enabling generalized agent abilities for llms,

A. Zeng, M. Liu, R. Lu, B. Wang, X. Liu, Y . Dong, and J. Tang, “Agenttuning: Enabling generalized agent abilities for llms,”arXiv preprint arXiv:2310.12823, 2023

work page arXiv 2023
[43]

Reinforced Self-Training (ReST) for Language Modeling

C. Gulcehre, T. L. Paine, S. Srinivasan, K. Konyushkova, L. Weerts, A. Sharma, A. Siddhant, A. Ahern, M. Wang, C. Guet al., “Re- inforced self-training (rest) for language modeling,”arXiv preprint arXiv:2308.08998, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Rest meets react: Self-improvement for multi-step reasoning llm agent,

R. Aksitov, S. Miryoosefi, Z. Li, D. Li, S. Babayan, K. Kopparapu, Z. Fisher, R. Guo, S. Prakash, P. Srinivasanet al., “Rest meets react: Self-improvement for multi-step reasoning llm agent,”arXiv preprint arXiv:2312.10003, 2023

work page arXiv 2023
[45]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

T. Guo, X. Chen, Y . Wang, R. Chang, S. Pei, N. V . Chawla, O. Wiest, and X. Zhang, “Large language model based multi-agents: A survey of progress and challenges,”arXiv preprint arXiv:2402.01680, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Synthetic data generation & multi-step rl for reasoning & tool use,

A. Goldie, A. Mirhoseini, H. Zhou, I. Cai, and C. D. Manning, “Synthetic data generation & multi-step rl for reasoning & tool use,” arXiv preprint arXiv:2504.04736, 2025

work page arXiv 2025
[47]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

S. Hong, X. Zheng, J. Chen, Y . Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhouet al., “Metagpt: Meta programming for multi-agent collaborative framework,”arXiv preprint arXiv:2308.00352, vol. 3, no. 4, p. 6, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

ChatDev: Communicative Agents for Software Development

C. Qian, X. Cong, C. Yang, W. Chen, Y . Su, J. Xu, Z. Liu, and M. Sun, “Communicative agents for software development,”arXiv preprint arXiv:2307.07924, vol. 6, no. 3, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Roco: Dialectic multi-robot col- laboration with large language models,

Z. Mandi, S. Jain, and S. Song, “Roco: Dialectic multi-robot col- laboration with large language models,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 286–299

work page 2024
[50]

Building cooperative embodied agents modularly with large language models

H. Zhang, W. Du, J. Shan, Q. Zhou, Y . Du, J. B. Tenenbaum, T. Shu, and C. Gan, “Building cooperative embodied agents modularly with large language models,”arXiv preprint arXiv:2307.02485, 2023

work page arXiv 2023
[51]

Generative agents: Interactive simulacra of human behavior,

J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” inProceedings of the 36th annual acm symposium on user interface software and technology, 2023, pp. 1–22. 42

work page 2023
[52]

Simulating public administration crisis: A novel generative agent-based simulation system to lower technology barriers in social science research,

B. Xiao, Z. Yin, and Z. Shan, “Simulating public administration crisis: A novel generative agent-based simulation system to lower technology barriers in social science research,”arXiv preprint arXiv:2311.06957, 2023

work page arXiv 2023
[53]

Avalon’s game of thoughts: Battle against deception through recursive contemplation,

S. Wang, C. Liu, Z. Zheng, S. Qi, S. Chen, Q. Yang, A. Zhao, C. Wang, S. Song, and G. Huang, “Avalon’s game of thoughts: Battle against deception through recursive contemplation,”arXiv preprint arXiv:2310.01320, 2023

work page arXiv 2023
[54]

Agents in software engineering: Survey, landscape, and vision,

Y . Wang, W. Zhong, Y . Huang, E. Shi, M. Yang, J. Chen, H. Li, Y . Ma, Q. Wang, and Z. Zheng, “Agents in software engineering: Survey, landscape, and vision,”arXiv preprint arXiv:2409.09030, 2024

work page arXiv 2024
[55]

From llms to llm- based agents for software engineering: A survey of current, challenges and future,

H. Jin, L. Huang, H. Cai, J. Yan, B. Li, and H. Chen, “From llms to llm- based agents for software engineering: A survey of current, challenges and future,”arXiv preprint arXiv:2408.02479, 2024

work page arXiv 2024
[56]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

A. Singh, A. Ehtesham, S. Kumar, and T. T. Khoei, “Agentic retrieval- augmented generation: A survey on agentic rag,”arXiv preprint arXiv:2501.09136, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Survey on evaluation of llm-based agents,

A. Yehudai, L. Eden, A. Li, G. Uziel, Y . Zhao, R. Bar-Haim, A. Cohan, and M. Shmueli-Scheuer, “Survey on evaluation of llm-based agents,”

work page
[58]

Survey on Evaluation of LLM-based Agents

[Online]. Available: https://arxiv.org/abs/2503.16416

work page internal anchor Pith review Pith/arXiv arXiv
[59]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Q. Chen, L. Qin, J. Liu, D. Peng, J. Guan, P. Wang, M. Hu, Y . Zhou, T. Gao, and W. Che, “Towards reasoning era: A survey of long chain-of-thought for reasoning large language models,”arXiv preprint arXiv:2503.09567, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

arXiv preprint arXiv:2502.14321 , year =

B. Yan, X. Zhang, L. Zhang, L. Zhang, Z. Zhou, D. Miao, and C. Li, “Beyond self-talk: A communication-centric survey of llm-based multi- agent systems,”arXiv preprint arXiv:2502.14321, 2025

work page arXiv 2025
[61]

A survey on large language model-based social agents in game-theoretic scenarios,

X. Feng, L. Dou, E. Li, Q. Wang, H. Wang, Y . Guo, C. Ma, and L. Kong, “A survey on large language model-based social agents in game-theoretic scenarios,”arXiv preprint arXiv:2412.03920, 2024

work page arXiv 2024
[62]

Large language model-brained gui agents: A survey,

C. Zhang, S. He, J. Qian, B. Li, L. Li, S. Qin, Y . Kang, M. Ma, G. Liu, Q. Linet al., “Large language model-brained gui agents: A survey,” arXiv preprint arXiv:2411.18279, 2024

work page arXiv 2024
[63]

Personal llm agents: Insights and survey about the capability, efficiency and security

Y . Li, H. Wen, W. Wang, X. Li, Y . Yuan, G. Liu, J. Liu, W. Xu, X. Wang, Y . Sunet al., “Personal llm agents: Insights and sur- vey about the capability, efficiency and security,”arXiv preprint arXiv:2401.05459, 2024

work page arXiv 2024
[64]

A review of large language models and autonomous agents in chemistry,

M. C. Ramos, C. J. Collison, and A. D. White, “A review of large language models and autonomous agents in chemistry,”Chemical Science, 2025

work page 2025
[65]

and Lee, Dean and Menghini, Cristina and others , year=

C. J. Wang, D. Lee, C. Menghini, J. Mols, J. Doughty, A. Khoja, J. Lynch, S. Hendryx, S. Yue, and D. Hendrycks, “Enigmaeval: A benchmark of long multimodal reasoning challenges,”arXiv preprint arXiv:2502.08859, 2025

work page arXiv 2025
[66]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[67]

Complexfuncbench: Exploring multi-step and constrained function calling under long- context scenario,

L. Zhong, Z. Du, X. Zhang, H. Hu, and J. Tang, “Complexfuncbench: Exploring multi-step and constrained function calling under long- context scenario,”arXiv preprint arXiv:2501.10132, 2025

work page arXiv 2025
[68]

Humanity's Last Exam

L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, S. Shi, M. Choi, A. Agrawal, A. Chopraet al., “Humanity’s last exam,”arXiv preprint arXiv:2501.14249, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

Facts & grounding: A new benchmark for evaluating the factuality of large language models,

DeepMind, “Facts & grounding: A new benchmark for evaluating the factuality of large language models,” 2023, accessed: 2025- 02-03. [Online]. Available: https://deepmind.google/discover/blog/ facts-grounding-a-new-benchmark-for-evaluating-the-factuality-of-\ large-language-models/

work page 2023
[70]

Processbench: Identifying process errors in mathematical reasoning,

C. Zheng, Z. Zhang, B. Zhang, R. Lin, K. Lu, B. Yu, D. Liu, J. Zhou, and J. Lin, “Processbench: Identifying process errors in mathematical reasoning,”arXiv preprint arXiv:2412.06559, 2024

work page arXiv 2024
[71]

Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations,

L. Ouyang, Y . Qu, H. Zhou, J. Zhu, R. Zhang, Q. Lin, B. Wang, Z. Zhao, M. Jiang, X. Zhaoet al., “Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations,”arXiv preprint arXiv:2412.07626, 2024

work page arXiv 2024
[72]

Does this step contain a critical error? Answer with only ‘yes’ or ‘no’

M. Zhuge, C. Zhao, D. Ashley, W. Wang, D. Khizbullin, Y . Xiong, Z. Liu, E. Chang, R. Krishnamoorthi, Y . Tianet al., “Agent-as-a-judge: Evaluate agents with agents,”arXiv preprint arXiv:2410.10934, 2024

work page arXiv 2024
[73]

Judgebench: A benchmark for evaluating llm-based judges,

S. Tan, S. Zhuang, K. Montgomery, W. Y . Tang, A. Cuadron, C. Wang, R. A. Popa, and I. Stoica, “Judgebench: A benchmark for evaluating llm-based judges,”arXiv preprint arXiv:2410.12784, 2024

work page arXiv 2024
[74]

Introducing simpleqa,

OpenAI, “Introducing simpleqa,” 2024, accessed: 2025-02-03. [Online]. Available: https://openai.com/index/introducing-simpleqa/

work page 2024
[75]

Fine tasks,

HuggingFaceFW, “Fine tasks,” 2024, accessed: 2025-02-03. [Online]. Available: https://huggingface.co/spaces/HuggingFaceFW/ blogpost-fine-tasks

work page 2024
[76]

Krishna, K

S. Krishna, K. Krishna, A. Mohananey, S. Schwarcz, A. Stam- bler, S. Upadhyay, and M. Faruqui, “Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation,”arXiv preprint arXiv:2409.12941, 2024

work page arXiv 2024
[77]

Dabstep,

Hugging Face, “Dabstep,” 2025, accessed: 2025-02-03. [Online]. Available: https://huggingface.co/blog/dabstep

work page 2025
[78]

Bfcl v2 live,

H. Mao, C. C.-J. Ji, F. Yan, T. Zhang, and S. G. Patil, “Bfcl v2 live,” https://gorilla.cs.berkeley.edu/blogs/12 bfcl v2 live.html, 2024, accessed: February 16, 2025

work page 2024
[79]

Swe-lancer: Can frontier llms earn $1 million from real world freelance software engineering?

S. Miserendino, M. Wang, T. Patwardhan, and J. Heidecke, “Swe-lancer: Can frontier llms earn $1 million from real world freelance software engineering?” 2025. [Online]. Available: https: //arxiv.org/abs/2502.12115

work page arXiv 2025
[80]

Crag–comprehensive rag benchmark,

X. Yang, K. Sun, H. Xin, Y . Sun, N. Bhalla, X. Chen, S. Choudhary, R. D. Gui, Z. W. Jiang, Z. Jianget al., “Crag–comprehensive rag benchmark,”arXiv preprint arXiv:2406.04744, 2024

work page arXiv 2024

Showing first 80 references.