Survey on Evaluation of LLM-based Agents

Alan Li; Arman Cohan; Asaf Yehudai; Guy Uziel; Lilach Eden; Michal Shmueli-Scheuer; Roy Bar-Haim; Yilun Zhao

arxiv: 2503.16416 · v2 · submitted 2025-03-20 · 💻 cs.AI · cs.CL· cs.LG

Survey on Evaluation of LLM-based Agents

Asaf Yehudai , Lilach Eden , Alan Li , Guy Uziel , Yilun Zhao , Roy Bar-Haim , Arman Cohan , Michal Shmueli-Scheuer This is my paper

Pith reviewed 2026-05-22 22:47 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords LLM-based agentsevaluation benchmarksagentic workflowsplanning and tool usesafety and robustnesscost-efficiencygeneralist agentsevaluation frameworks

0 comments

The pith

This survey organizes evaluation methods for LLM-based agents into five perspectives and identifies trends plus gaps in safety, cost, and robustness testing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys how researchers test LLM-based agents that plan, reason with tools, and act in changing environments. It groups existing work into five perspectives covering basic capabilities such as planning, domain-specific tests like web agents, general-purpose agents, the structure of benchmarks themselves, and practical frameworks for developers. Analysis shows movement toward harder, regularly refreshed evaluations that better match real use. It also flags missing coverage of efficiency, safety, and reliability measures. Readers would care because clear evaluation standards shape whether these agents can be trusted in practical settings.

Core claim

The paper claims to deliver the first broad review of LLM agent evaluation by breaking the literature into five perspectives—core capabilities required for agent workflows, application-specific benchmarks, generalist agent testing, analysis of benchmark dimensions, and developer tools—while documenting a trend toward realistic and updated benchmarks and calling out shortfalls in cost-efficiency, safety, and robustness assessment plus the need for finer-grained scalable methods.

What carries the argument

The five perspectives used to categorize and analyze all agent evaluation methods and benchmarks.

If this is right

Future work must create tests that measure how much compute or money an agent uses.
Safety properties need explicit benchmarks rather than being assumed.
Robustness checks against changing conditions should become standard.
Evaluation methods need to scale while staying detailed enough to guide improvements.
Benchmarks will require ongoing updates to stay relevant and difficult.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Clearer evaluation gaps may steer agent developers toward adding separate modules for safety monitoring.
Connecting the identified shortfalls to existing work on AI alignment could speed up progress on robustness.
A practical next step would be to build one benchmark that jointly scores task success, cost, and safety on the same agent runs.
Wider adoption of the survey's structure could make it easier to compare results across different research groups.

Load-bearing premise

The chosen papers and their placement into the five perspectives give a complete and unbiased picture of current research on LLM agent evaluation.

What would settle it

A large set of published evaluation methods or benchmarks for LLM agents that the survey does not include or that cannot be placed in any of the five perspectives.

Figures

Figures reproduced from arXiv: 2503.16416 by Alan Li, Arman Cohan, Asaf Yehudai, Guy Uziel, Lilach Eden, Michal Shmueli-Scheuer, Roy Bar-Haim, Yilun Zhao.

**Figure 1.** Figure 1: Overview of the paper. and memory. We then review benchmarks and evaluation strategies for prominent types of agentic applications: web agents, software engineering agents, scientific agents and conversational agents (§3). Next, we describe benchmarks and leaderboards for evaluating general-purpose agents (§4), which assess the agent’s ability to perform different tasks that require diverse skills. The ne… view at source ↗

read the original abstract

LLM-based agents represent a paradigm shift in AI, enabling autonomous systems to plan, reason, and use tools while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methods for these increasingly capable agents. We analyze the field of agent evaluation across five perspectives: (1) Core LLM capabilities needed for agentic workflows, like planning, and tool use; (2) Application-specific benchmarks such as web and SWE agents; (3) Evaluation of generalist agents; (4) Analysis of agent benchmarks' core dimensions; and (5) Evaluation frameworks and tools for agent developers. Our analysis reveals current trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address, particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, scalable evaluation methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Survey organizes LLM agent evaluation literature into five perspectives but its 'first comprehensive' claim depends on unshown search details.

read the letter

This paper is a survey that pulls together evaluation methods for LLM-based agents and organizes them under five headings: core LLM skills like planning and tool use, application-specific benchmarks, generalist agents, benchmark dimensions, and developer frameworks and tools. It notes a trend toward more realistic and continuously updated tests while flagging gaps in cost, safety, and robustness assessment. That structure gives a readable map of the area without introducing new experiments or proofs. The synthesis itself is the main contribution, and it can help someone new to agent work see how the pieces fit. The soft spot is the claim of being the first comprehensive survey. That rests on the authors having covered the literature without major omissions or selection bias, yet the abstract supplies no search protocol, databases, date ranges, or inclusion criteria. If the full text lacks a clear methods section spelling those out, the reported trends and gaps could reflect incomplete coverage rather than the actual state of the field. No equations or fitted models appear, so there is no circularity or parameter issue to worry about. The paper targets readers who want an overview of agent benchmarks rather than specialists already tracking every new eval. It is the sort of organized review that can serve as a starting reference if the literature selection proves thorough. I would send it to peer review so referees can verify the coverage and suggest any missing papers; the topic is timely enough that a cleaned-up version would be worth having even with revisions.

Referee Report

2 major / 1 minor

Summary. The paper is a literature survey claiming to be the first comprehensive review of evaluation methods for LLM-based agents. It structures the analysis around five perspectives—(1) core LLM capabilities such as planning and tool use, (2) application-specific benchmarks (e.g., web and SWE agents), (3) generalist agents, (4) core dimensions of agent benchmarks, and (5) evaluation frameworks and tools—while identifying trends toward realistic, continuously updated benchmarks and gaps in cost-efficiency, safety, robustness, and fine-grained scalable methods.

Significance. If the literature selection and five-perspective categorization prove complete and unbiased, the survey would consolidate a rapidly evolving subfield and usefully highlight actionable gaps for future work on agent evaluation.

major comments (2)

[Abstract] Abstract: the central claim that this is the 'first comprehensive survey' is load-bearing for the paper's contribution, yet the abstract (and by extension the manuscript framing) provides no details on search methodology, databases queried, inclusion/exclusion criteria, date cutoffs, or explicit comparison against prior surveys on agent or LLM evaluation; without this, the identified trends and gaps cannot be verified as exhaustive rather than artifacts of selection.
[Five perspectives] Five-perspective structure: the mapping of the literature onto the chosen five perspectives requires explicit justification and a discussion of boundary cases; otherwise it is unclear whether important evaluation dimensions (e.g., multi-agent interaction or long-horizon safety) fall outside the frame and thereby affect the completeness of the gap analysis.

minor comments (1)

Add a dedicated methods subsection (or appendix) that reports the PRISMA-style flow or equivalent, keyword strings, and total papers screened/included so readers can assess coverage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate additional methodological details and justifications.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that this is the 'first comprehensive survey' is load-bearing for the paper's contribution, yet the abstract (and by extension the manuscript framing) provides no details on search methodology, databases queried, inclusion/exclusion criteria, date cutoffs, or explicit comparison against prior surveys on agent or LLM evaluation; without this, the identified trends and gaps cannot be verified as exhaustive rather than artifacts of selection.

Authors: We agree that explicit details on the literature search process are required to substantiate the claim of a first comprehensive survey. In the revised manuscript we will add a dedicated 'Survey Methodology' subsection (in the Introduction) that specifies the databases queried (arXiv, Google Scholar, ACL Anthology), search queries and keywords, date range, inclusion/exclusion criteria, and a direct comparison with prior surveys on LLM or agent evaluation. This addition will allow verification that the identified trends and gaps are not selection artifacts. revision: yes
Referee: [Five perspectives] Five-perspective structure: the mapping of the literature onto the chosen five perspectives requires explicit justification and a discussion of boundary cases; otherwise it is unclear whether important evaluation dimensions (e.g., multi-agent interaction or long-horizon safety) fall outside the frame and thereby affect the completeness of the gap analysis.

Authors: We accept that the five-perspective categorization needs explicit justification and boundary-case discussion. The revision will include an expanded paragraph (in Section 1) explaining the derivation of the five perspectives from an initial literature scan, their coverage of core capabilities through frameworks, and how boundary topics such as multi-agent interaction and long-horizon safety are either subsumed under 'generalist agents' or 'core dimensions' or flagged as open gaps requiring future work. This will clarify the frame's completeness. revision: yes

Circularity Check

0 steps flagged

No circularity: literature survey with no derivations or self-referential reductions

full rationale

This is a survey paper with no equations, fitted parameters, predictions, or mathematical derivations. The central claim of providing the first comprehensive survey across five perspectives rests on literature selection and categorization, but the instructions require explicit quotes showing reduction by construction (e.g., self-definition or fitted input renamed as prediction). No such steps exist. The paper is self-contained as a review and does not invoke uniqueness theorems, ansatzes, or self-citations in a load-bearing circular manner. Score 0 is the appropriate default for honest non-findings in survey work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the reviewed literature is representative and that the authors' synthesis is accurate.

axioms (1)

domain assumption The five perspectives comprehensively cover the field of LLM agent evaluation.
Invoked by structuring the entire analysis around these categories.

pith-pipeline@v0.9.0 · 10013 in / 1105 out tokens · 114620 ms · 2026-05-22T22:47:21.636661+00:00 · methodology

discussion (0)

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend
cs.SE 2026-04 unverdicted novelty 8.0

CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.
Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents
cs.CY 2026-04 accept novelty 8.0

This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that be...
MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers
cs.SE 2026-01 accept novelty 8.0

MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.
Agentic Large Language Models for Training-Free Neuro-Radiological Image Analysis
cs.CV 2026-04 unverdicted novelty 7.0

Agentic LLMs autonomously execute complex neuro-radiological workflows like glioma segmentation and multi-timepoint response assessment by directing off-the-shelf tools, without any model training.
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench
cs.AI 2026-04 conditional novelty 7.0

AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cu...
FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks
cs.AI 2026-04 unverdicted novelty 7.0

FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.
Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models
cs.AI 2026-02 unverdicted novelty 7.0

Frontier LLMs display emerging investigatory agency in autonomous database analysis but struggle with long-horizon exploration on the new DDR-Bench.
MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers
cs.SE 2026-01 unverdicted novelty 7.0

MCP-Atlas introduces a benchmark of 36 real MCP servers, 220 tools, and 1,000 natural-language tasks to measure LLM tool-use competency in multi-server workflows.
SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations
cs.CL 2026-05 unverdicted novelty 6.0

SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.
AgentAtlas: Beyond Outcome Leaderboards for LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

AgentAtlas defines a six-state control taxonomy and nine-category failure taxonomy, then shows that removing explicit label menus from prompts drops trajectory accuracy 14-40 points to a 0.54-0.62 floor across eight models.
The Scaling Laws of Skills in LLM Agent Systems
cs.CL 2026-05 unverdicted novelty 6.0

Empirical analysis across 15 LLMs and 1,141 skills identifies a logarithmic routing decay law and a multiplicative execution law coupled by a single fitted slope parameter b that enables targeted library optimizations...
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
cs.CR 2026-05 unverdicted novelty 6.0

SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
cs.AI 2026-05 unverdicted novelty 6.0

ComplexMCP benchmark shows top LLM agents achieve under 60% success on dynamic interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
cs.AI 2026-05 unverdicted novelty 6.0

ComplexMCP benchmark shows current LLM agents achieve at most 60% success on interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation
cs.AI 2026-05 unverdicted novelty 6.0

A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 ...
Agentic AI-Enabled Framework for Thermal Comfort and Building Energy Assessment in Tropical Urban Neighborhoods
cs.MA 2026-04 unverdicted novelty 6.0

The study introduces an agentic AI framework integrating LLMs with lightweight physics models to evaluate thermal comfort and building energy in tropical urban neighborhoods.
Diagnosing CFG Interpretation in LLMs
cs.AI 2026-04 unverdicted novelty 6.0

LLMs maintain surface syntax for novel CFGs but fail to preserve semantics under recursion and branching, relying on keyword bootstrapping rather than pure symbolic reasoning.
SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems
cs.AI 2026-04 unverdicted novelty 6.0

SocialGrid benchmark shows even top LLMs achieve below 60% in embodied planning and task completion, with deception detection near random chance regardless of model scale.
CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V
cs.AI 2026-04 unverdicted novelty 6.0

CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
cs.CR 2026-02 unverdicted novelty 6.0

The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis
cs.AI 2025-07 unverdicted novelty 6.0

GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming pri...
Beyond Task Success: An Evidence-Synthesis Framework for Evaluating, Governing, and Orchestrating Agentic AI
cs.SE 2026-04 unverdicted novelty 5.0

Agentic AI evaluation and governance lack mechanisms to bind obligations to actions and prove compliance at runtime; a new synthesis framework with ODTA criteria and action-evidence bundles addresses this closure gap.
Imperfectly Cooperative Human-AI Interactions: Comparing the Impacts of Human and AI Attributes in Simulated and User Studies
cs.CL 2026-04 unverdicted novelty 5.0

In real human subjects, AI transparency impacts imperfectly cooperative interactions far more than personality traits, unlike simulations where both are comparably influential.
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
cs.AI 2026-04 unverdicted novelty 5.0

AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.
Train the Trainers -- An Agentic AI Framework for Peer-Based Mental Health Support in Battlefield Environments
cs.HC 2026-03 unverdicted novelty 5.0

The paper introduces an agentic AI platform to train and support recovered soldiers as peer facilitators providing mental health triage and interventions in austere battlefield environments.
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
cs.AI 2025-08 unverdicted novelty 5.0

A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows
cs.AI 2026-04 unverdicted novelty 4.0

CMBAgent achieves high accuracy on well-specified astrophysical tasks with context but generates silent, plausible-yet-incorrect outputs on reasoning-challenging problems, with no self-diagnosis of inconsistencies.
Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support
cs.AI 2026-04 unverdicted novelty 4.0

A cross-platform mobile application deploys an ensemble of quantized open-source LLMs for fully local, DSM-5-aligned psychiatric decision support with claimed accuracy comparable to prior cloud versions.
Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub
cs.CL 2026-03 unverdicted novelty 4.0

Analysis of ClawHub shows language-based functional divides in agent skills, with over 30% flagged suspicious and submission-time documentation enabling 73% accurate risk prediction.
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
cs.AI 2025-04 accept novelty 4.0

A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
Flowr -- Scaling Up Retail Supply Chain Operations Through Agentic AI in Large Scale Supermarket Chains
cs.AI 2026-04 unverdicted novelty 3.0

Flowr is an agentic AI framework that decomposes retail supply chain workflows into coordinated LLM-based agents with human-in-the-loop oversight to automate operations in large supermarket chains.
LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review
cs.SE 2026-02 unverdicted novelty 3.0

A review of 114 studies classifies motivations into nine categories, analyzes common models and benchmarks, synthesizes challenges into six categories with 26 subcategories and solutions, and identifies six future res...

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 30 Pith papers · 5 internal anchors

[1]

ArXiv, abs/2410.06992

Swe-bench+: Enhanced coding benchmark for llms. ArXiv, abs/2410.06992. Jacob Andreas, John Bufe, David Burkett, Charles Chen, Josh Clausman, Jean Crawford, Kate Crim, Jordan DeLoach, Leah Dorner, Jason Eisner, Hao Fang, Alan Guo, David Hall, Kristin Hayes, Kellie Hill, Diana Ho, Wendy Iwaszuk, Smriti Jha, Dan Klein, Jayant Krishnamurthy, Theo Lanman, Perc...

work page arXiv 2020
[2]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word prob- lems. arXiv preprint arXiv:2110.14168. Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. 2021. A dataset of information-seeking questions and answers anchored in research papers. Preprint, arXiv:2105.03011. Databricks. 2023. Mosaic ai agent evaluation: Assess- ing ai application ...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language under- standing. arXiv preprint arXiv:2009.03300. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021b. Measuring mathemati- cal problem solving with the math dataset. arXiv preprint arXiv:2103.03874. Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu S...

work page internal anchor Pith review Pith/arXiv arXiv 2009
[4]

Preprint, arXiv:2501.13121

Episodic memories generation and evalua- tion benchmark for large language models. Preprint, arXiv:2501.13121. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. 2024. Live- codebench: Holistic and contamination free evalu- ation of large language models for code. Preprint,...

work page arXiv 2024
[5]

GAIA: a benchmark for General AI Assistants

Gaia: a benchmark for general ai assistants. Preprint, arXiv:2311.12983. Samuel Miserendino, Michele Wang, Tejal Patward- han, and Johannes Heidecke. 2025. Swe-lancer: Can frontier llms earn $1 million from real-world freelance software engineering? arXiv preprint arXiv:2502.12115. Niels Mündler, Mark Müller, Jingxuan He, and Martin Vechev. 2024. Swt-benc...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

MemGPT: Towards LLMs as Operating Systems

Memgpt: Towards llms as operating systems. Preprint, arXiv:2310.08560. Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2022. Medmcqa : A large-scale multi- subject multi-choice dataset for medical domain ques- tion answering. Preprint, arXiv:2203.14371. Jane Pan, Ryan Shar, Jacob Pfau, Ameet Talwalkar, He He, and Valerie Chen. 2025. When b...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Reflexion: language agents with verbal re- inforcement learning. In Neural Information Pro- cessing Systems. Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2021. Alfworld: Aligning text and embodied environments for interactive learning. Preprint, arXiv:2010.03768. Chenglei Si, Diyi Yang, and Tatsun...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16022–16076, Bangkok, Thai- land

AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16022–16076, Bangkok, Thai- land. Association for Computational Linguistics. Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedha...

work page
[9]

Advances in Neural Information Processing Systems, 36:38975–38987

Planbench: An extensible benchmark for eval- uating large language models on planning and reason- ing about change. Advances in Neural Information Processing Systems, 36:38975–38987. David Wadden, Kejian Shi, Jacob Morrison, Aakanksha Naik, Shruti Singh, Nitzan Barzilay, Kyle Lo, Tom Hope, Luca Soldaini, Shannon Zejiang Shen, Doug Downey, Hannaneh Hajishi...

work page
[10]

Preprint, arXiv:2406.07835

Sciriff: A resource to enhance language model instruction-following over scientific literature. Preprint, arXiv:2406.07835. Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024a. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345...

work page arXiv 2022
[11]

2407.15711 , archivePrefix=

Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822. Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Be- rant. 2024. Assistantbench: Can web agents solve realistic and time-consuming tasks? arXiv preprint arXiv:2407.15711. Jiaxu...

work page arXiv 2024
[12]

arXiv preprint arXiv:2404.09992

Mmina: Benchmarking multihop multimodal internet agents. arXiv preprint arXiv:2404.09992. Huaixiu Steven Zheng, Swaroop Mishra, Hugh Zhang, Xinyun Chen, Minmin Chen, Azade Nova, Le Hou, Heng-Tze Cheng, Quoc V Le, Ed H Chi, et al. 2024. Natural plan: Benchmarking llms on natural lan- guage planning. arXiv preprint arXiv:2406.04520. Lucen Zhong, Zhengxiao D...

work page arXiv 2024

[1] [1]

ArXiv, abs/2410.06992

Swe-bench+: Enhanced coding benchmark for llms. ArXiv, abs/2410.06992. Jacob Andreas, John Bufe, David Burkett, Charles Chen, Josh Clausman, Jean Crawford, Kate Crim, Jordan DeLoach, Leah Dorner, Jason Eisner, Hao Fang, Alan Guo, David Hall, Kristin Hayes, Kellie Hill, Diana Ho, Wendy Iwaszuk, Smriti Jha, Dan Klein, Jayant Krishnamurthy, Theo Lanman, Perc...

work page arXiv 2020

[2] [2]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word prob- lems. arXiv preprint arXiv:2110.14168. Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. 2021. A dataset of information-seeking questions and answers anchored in research papers. Preprint, arXiv:2105.03011. Databricks. 2023. Mosaic ai agent evaluation: Assess- ing ai application ...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language under- standing. arXiv preprint arXiv:2009.03300. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021b. Measuring mathemati- cal problem solving with the math dataset. arXiv preprint arXiv:2103.03874. Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu S...

work page internal anchor Pith review Pith/arXiv arXiv 2009

[4] [4]

Preprint, arXiv:2501.13121

Episodic memories generation and evalua- tion benchmark for large language models. Preprint, arXiv:2501.13121. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. 2024. Live- codebench: Holistic and contamination free evalu- ation of large language models for code. Preprint,...

work page arXiv 2024

[5] [5]

GAIA: a benchmark for General AI Assistants

Gaia: a benchmark for general ai assistants. Preprint, arXiv:2311.12983. Samuel Miserendino, Michele Wang, Tejal Patward- han, and Johannes Heidecke. 2025. Swe-lancer: Can frontier llms earn $1 million from real-world freelance software engineering? arXiv preprint arXiv:2502.12115. Niels Mündler, Mark Müller, Jingxuan He, and Martin Vechev. 2024. Swt-benc...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

MemGPT: Towards LLMs as Operating Systems

Memgpt: Towards llms as operating systems. Preprint, arXiv:2310.08560. Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2022. Medmcqa : A large-scale multi- subject multi-choice dataset for medical domain ques- tion answering. Preprint, arXiv:2203.14371. Jane Pan, Ryan Shar, Jacob Pfau, Ameet Talwalkar, He He, and Valerie Chen. 2025. When b...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Reflexion: language agents with verbal re- inforcement learning. In Neural Information Pro- cessing Systems. Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2021. Alfworld: Aligning text and embodied environments for interactive learning. Preprint, arXiv:2010.03768. Chenglei Si, Diyi Yang, and Tatsun...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16022–16076, Bangkok, Thai- land

AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16022–16076, Bangkok, Thai- land. Association for Computational Linguistics. Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedha...

work page

[9] [9]

Advances in Neural Information Processing Systems, 36:38975–38987

Planbench: An extensible benchmark for eval- uating large language models on planning and reason- ing about change. Advances in Neural Information Processing Systems, 36:38975–38987. David Wadden, Kejian Shi, Jacob Morrison, Aakanksha Naik, Shruti Singh, Nitzan Barzilay, Kyle Lo, Tom Hope, Luca Soldaini, Shannon Zejiang Shen, Doug Downey, Hannaneh Hajishi...

work page

[10] [10]

Preprint, arXiv:2406.07835

Sciriff: A resource to enhance language model instruction-following over scientific literature. Preprint, arXiv:2406.07835. Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024a. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345...

work page arXiv 2022

[11] [11]

2407.15711 , archivePrefix=

Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822. Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Be- rant. 2024. Assistantbench: Can web agents solve realistic and time-consuming tasks? arXiv preprint arXiv:2407.15711. Jiaxu...

work page arXiv 2024

[12] [12]

arXiv preprint arXiv:2404.09992

Mmina: Benchmarking multihop multimodal internet agents. arXiv preprint arXiv:2404.09992. Huaixiu Steven Zheng, Swaroop Mishra, Hugh Zhang, Xinyun Chen, Minmin Chen, Azade Nova, Le Hou, Heng-Tze Cheng, Quoc V Le, Ed H Chi, et al. 2024. Natural plan: Benchmarking llms on natural lan- guage planning. arXiv preprint arXiv:2406.04520. Lucen Zhong, Zhengxiao D...

work page arXiv 2024