Survey on Evaluation of LLM-based Agents
Pith reviewed 2026-05-22 22:47 UTC · model grok-4.3
The pith
This survey organizes evaluation methods for LLM-based agents into five perspectives and identifies trends plus gaps in safety, cost, and robustness testing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims to deliver the first broad review of LLM agent evaluation by breaking the literature into five perspectives—core capabilities required for agent workflows, application-specific benchmarks, generalist agent testing, analysis of benchmark dimensions, and developer tools—while documenting a trend toward realistic and updated benchmarks and calling out shortfalls in cost-efficiency, safety, and robustness assessment plus the need for finer-grained scalable methods.
What carries the argument
The five perspectives used to categorize and analyze all agent evaluation methods and benchmarks.
If this is right
- Future work must create tests that measure how much compute or money an agent uses.
- Safety properties need explicit benchmarks rather than being assumed.
- Robustness checks against changing conditions should become standard.
- Evaluation methods need to scale while staying detailed enough to guide improvements.
- Benchmarks will require ongoing updates to stay relevant and difficult.
Where Pith is reading between the lines
- Clearer evaluation gaps may steer agent developers toward adding separate modules for safety monitoring.
- Connecting the identified shortfalls to existing work on AI alignment could speed up progress on robustness.
- A practical next step would be to build one benchmark that jointly scores task success, cost, and safety on the same agent runs.
- Wider adoption of the survey's structure could make it easier to compare results across different research groups.
Load-bearing premise
The chosen papers and their placement into the five perspectives give a complete and unbiased picture of current research on LLM agent evaluation.
What would settle it
A large set of published evaluation methods or benchmarks for LLM agents that the survey does not include or that cannot be placed in any of the five perspectives.
Figures
read the original abstract
LLM-based agents represent a paradigm shift in AI, enabling autonomous systems to plan, reason, and use tools while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methods for these increasingly capable agents. We analyze the field of agent evaluation across five perspectives: (1) Core LLM capabilities needed for agentic workflows, like planning, and tool use; (2) Application-specific benchmarks such as web and SWE agents; (3) Evaluation of generalist agents; (4) Analysis of agent benchmarks' core dimensions; and (5) Evaluation frameworks and tools for agent developers. Our analysis reveals current trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address, particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, scalable evaluation methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper is a literature survey claiming to be the first comprehensive review of evaluation methods for LLM-based agents. It structures the analysis around five perspectives—(1) core LLM capabilities such as planning and tool use, (2) application-specific benchmarks (e.g., web and SWE agents), (3) generalist agents, (4) core dimensions of agent benchmarks, and (5) evaluation frameworks and tools—while identifying trends toward realistic, continuously updated benchmarks and gaps in cost-efficiency, safety, robustness, and fine-grained scalable methods.
Significance. If the literature selection and five-perspective categorization prove complete and unbiased, the survey would consolidate a rapidly evolving subfield and usefully highlight actionable gaps for future work on agent evaluation.
major comments (2)
- [Abstract] Abstract: the central claim that this is the 'first comprehensive survey' is load-bearing for the paper's contribution, yet the abstract (and by extension the manuscript framing) provides no details on search methodology, databases queried, inclusion/exclusion criteria, date cutoffs, or explicit comparison against prior surveys on agent or LLM evaluation; without this, the identified trends and gaps cannot be verified as exhaustive rather than artifacts of selection.
- [Five perspectives] Five-perspective structure: the mapping of the literature onto the chosen five perspectives requires explicit justification and a discussion of boundary cases; otherwise it is unclear whether important evaluation dimensions (e.g., multi-agent interaction or long-horizon safety) fall outside the frame and thereby affect the completeness of the gap analysis.
minor comments (1)
- Add a dedicated methods subsection (or appendix) that reports the PRISMA-style flow or equivalent, keyword strings, and total papers screened/included so readers can assess coverage.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate additional methodological details and justifications.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that this is the 'first comprehensive survey' is load-bearing for the paper's contribution, yet the abstract (and by extension the manuscript framing) provides no details on search methodology, databases queried, inclusion/exclusion criteria, date cutoffs, or explicit comparison against prior surveys on agent or LLM evaluation; without this, the identified trends and gaps cannot be verified as exhaustive rather than artifacts of selection.
Authors: We agree that explicit details on the literature search process are required to substantiate the claim of a first comprehensive survey. In the revised manuscript we will add a dedicated 'Survey Methodology' subsection (in the Introduction) that specifies the databases queried (arXiv, Google Scholar, ACL Anthology), search queries and keywords, date range, inclusion/exclusion criteria, and a direct comparison with prior surveys on LLM or agent evaluation. This addition will allow verification that the identified trends and gaps are not selection artifacts. revision: yes
-
Referee: [Five perspectives] Five-perspective structure: the mapping of the literature onto the chosen five perspectives requires explicit justification and a discussion of boundary cases; otherwise it is unclear whether important evaluation dimensions (e.g., multi-agent interaction or long-horizon safety) fall outside the frame and thereby affect the completeness of the gap analysis.
Authors: We accept that the five-perspective categorization needs explicit justification and boundary-case discussion. The revision will include an expanded paragraph (in Section 1) explaining the derivation of the five perspectives from an initial literature scan, their coverage of core capabilities through frameworks, and how boundary topics such as multi-agent interaction and long-horizon safety are either subsumed under 'generalist agents' or 'core dimensions' or flagged as open gaps requiring future work. This will clarify the frame's completeness. revision: yes
Circularity Check
No circularity: literature survey with no derivations or self-referential reductions
full rationale
This is a survey paper with no equations, fitted parameters, predictions, or mathematical derivations. The central claim of providing the first comprehensive survey across five perspectives rests on literature selection and categorization, but the instructions require explicit quotes showing reduction by construction (e.g., self-definition or fitted input renamed as prediction). No such steps exist. The paper is self-contained as a review and does not invoke uniqueness theorems, ansatzes, or self-citations in a load-bearing circular manner. Score 0 is the appropriate default for honest non-findings in survey work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The five perspectives comprehensively cover the field of LLM agent evaluation.
Forward citations
Cited by 32 Pith papers
-
CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend
CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.
-
Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents
This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that be...
-
MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers
MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.
-
Agentic Large Language Models for Training-Free Neuro-Radiological Image Analysis
Agentic LLMs autonomously execute complex neuro-radiological workflows like glioma segmentation and multi-timepoint response assessment by directing off-the-shelf tools, without any model training.
-
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench
AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cu...
-
FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks
FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.
-
Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models
Frontier LLMs display emerging investigatory agency in autonomous database analysis but struggle with long-horizon exploration on the new DDR-Bench.
-
MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers
MCP-Atlas introduces a benchmark of 36 real MCP servers, 220 tools, and 1,000 natural-language tasks to measure LLM tool-use competency in multi-server workflows.
-
SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations
SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.
-
AgentAtlas: Beyond Outcome Leaderboards for LLM Agents
AgentAtlas defines a six-state control taxonomy and nine-category failure taxonomy, then shows that removing explicit label menus from prompts drops trajectory accuracy 14-40 points to a 0.54-0.62 floor across eight models.
-
The Scaling Laws of Skills in LLM Agent Systems
Empirical analysis across 15 LLMs and 1,141 skills identifies a logarithmic routing decay law and a multiplicative execution law coupled by a single fitted slope parameter b that enables targeted library optimizations...
-
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
-
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
ComplexMCP benchmark shows top LLM agents achieve under 60% success on dynamic interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
-
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
ComplexMCP benchmark shows current LLM agents achieve at most 60% success on interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
-
Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation
A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 ...
-
Agentic AI-Enabled Framework for Thermal Comfort and Building Energy Assessment in Tropical Urban Neighborhoods
The study introduces an agentic AI framework integrating LLMs with lightweight physics models to evaluate thermal comfort and building energy in tropical urban neighborhoods.
-
Diagnosing CFG Interpretation in LLMs
LLMs maintain surface syntax for novel CFGs but fail to preserve semantics under recursion and branching, relying on keyword bootstrapping rather than pure symbolic reasoning.
-
SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems
SocialGrid benchmark shows even top LLMs achieve below 60% in embodied planning and task completion, with deception detection near random chance regardless of model scale.
-
CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V
CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.
-
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
-
GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis
GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming pri...
-
Beyond Task Success: An Evidence-Synthesis Framework for Evaluating, Governing, and Orchestrating Agentic AI
Agentic AI evaluation and governance lack mechanisms to bind obligations to actions and prove compliance at runtime; a new synthesis framework with ODTA criteria and action-evidence bundles addresses this closure gap.
-
Imperfectly Cooperative Human-AI Interactions: Comparing the Impacts of Human and AI Attributes in Simulated and User Studies
In real human subjects, AI transparency impacts imperfectly cooperative interactions far more than personality traits, unlike simulations where both are comparably influential.
-
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.
-
Train the Trainers -- An Agentic AI Framework for Peer-Based Mental Health Support in Battlefield Environments
The paper introduces an agentic AI platform to train and support recovered soldiers as peer facilitators providing mental health triage and interventions in austere battlefield environments.
-
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
-
Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows
CMBAgent achieves high accuracy on well-specified astrophysical tasks with context but generates silent, plausible-yet-incorrect outputs on reasoning-challenging problems, with no self-diagnosis of inconsistencies.
-
Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support
A cross-platform mobile application deploys an ensemble of quantized open-source LLMs for fully local, DSM-5-aligned psychiatric decision support with claimed accuracy comparable to prior cloud versions.
-
Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub
Analysis of ClawHub shows language-based functional divides in agent skills, with over 30% flagged suspicious and submission-time documentation enabling 73% accurate risk prediction.
-
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
-
Flowr -- Scaling Up Retail Supply Chain Operations Through Agentic AI in Large Scale Supermarket Chains
Flowr is an agentic AI framework that decomposes retail supply chain workflows into coordinated LLM-based agents with human-in-the-loop oversight to automate operations in large supermarket chains.
-
LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review
A review of 114 studies classifies motivations into nine categories, analyzes common models and benchmarks, synthesizes challenges into six categories with 26 subcategories and solutions, and identifies six future res...
Reference graph
Works this paper leans on
-
[1]
Swe-bench+: Enhanced coding benchmark for llms. ArXiv, abs/2410.06992. Jacob Andreas, John Bufe, David Burkett, Charles Chen, Josh Clausman, Jean Crawford, Kate Crim, Jordan DeLoach, Leah Dorner, Jason Eisner, Hao Fang, Alan Guo, David Hall, Kristin Hayes, Kellie Hill, Diana Ho, Wendy Iwaszuk, Smriti Jha, Dan Klein, Jayant Krishnamurthy, Theo Lanman, Perc...
-
[2]
Training Verifiers to Solve Math Word Problems
Training verifiers to solve math word prob- lems. arXiv preprint arXiv:2110.14168. Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. 2021. A dataset of information-seeking questions and answers anchored in research papers. Preprint, arXiv:2105.03011. Databricks. 2023. Mosaic ai agent evaluation: Assess- ing ai application ...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Measuring Massive Multitask Language Understanding
Measuring massive multitask language under- standing. arXiv preprint arXiv:2009.03300. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021b. Measuring mathemati- cal problem solving with the math dataset. arXiv preprint arXiv:2103.03874. Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu S...
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[4]
Episodic memories generation and evalua- tion benchmark for large language models. Preprint, arXiv:2501.13121. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. 2024. Live- codebench: Holistic and contamination free evalu- ation of large language models for code. Preprint,...
-
[5]
GAIA: a benchmark for General AI Assistants
Gaia: a benchmark for general ai assistants. Preprint, arXiv:2311.12983. Samuel Miserendino, Michele Wang, Tejal Patward- han, and Johannes Heidecke. 2025. Swe-lancer: Can frontier llms earn $1 million from real-world freelance software engineering? arXiv preprint arXiv:2502.12115. Niels Mündler, Mark Müller, Jingxuan He, and Martin Vechev. 2024. Swt-benc...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
MemGPT: Towards LLMs as Operating Systems
Memgpt: Towards llms as operating systems. Preprint, arXiv:2310.08560. Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2022. Medmcqa : A large-scale multi- subject multi-choice dataset for medical domain ques- tion answering. Preprint, arXiv:2203.14371. Jane Pan, Ryan Shar, Jacob Pfau, Ameet Talwalkar, He He, and Valerie Chen. 2025. When b...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Reflexion: language agents with verbal re- inforcement learning. In Neural Information Pro- cessing Systems. Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2021. Alfworld: Aligning text and embodied environments for interactive learning. Preprint, arXiv:2010.03768. Chenglei Si, Diyi Yang, and Tatsun...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16022–16076, Bangkok, Thai- land. Association for Computational Linguistics. Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedha...
-
[9]
Advances in Neural Information Processing Systems, 36:38975–38987
Planbench: An extensible benchmark for eval- uating large language models on planning and reason- ing about change. Advances in Neural Information Processing Systems, 36:38975–38987. David Wadden, Kejian Shi, Jacob Morrison, Aakanksha Naik, Shruti Singh, Nitzan Barzilay, Kyle Lo, Tom Hope, Luca Soldaini, Shannon Zejiang Shen, Doug Downey, Hannaneh Hajishi...
-
[10]
Sciriff: A resource to enhance language model instruction-following over scientific literature. Preprint, arXiv:2406.07835. Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024a. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345...
-
[11]
Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822. Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Be- rant. 2024. Assistantbench: Can web agents solve realistic and time-consuming tasks? arXiv preprint arXiv:2407.15711. Jiaxu...
-
[12]
arXiv preprint arXiv:2404.09992
Mmina: Benchmarking multihop multimodal internet agents. arXiv preprint arXiv:2404.09992. Huaixiu Steven Zheng, Swaroop Mishra, Hugh Zhang, Xinyun Chen, Minmin Chen, Azade Nova, Le Hou, Heng-Tze Cheng, Quoc V Le, Ed H Chi, et al. 2024. Natural plan: Benchmarking llms on natural lan- guage planning. arXiv preprint arXiv:2406.04520. Lucen Zhong, Zhengxiao D...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.