arxiv: 2605.14865 · v1 · submitted 2026-05-14 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Holistic Evaluation and Failure Diagnosis of AI Agents

Netta Madvil , Gilad Dym , Alon Mecilati , Edo Dekel , Jonatan Liberman , Rotem Brazilay , Liron Schliesser , Max Svidlo

show 7 more authors

Shai Nir Orel Shalom Yaron Friedman David Connack Amos Rimon Philip Tannor Shir Chorev

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:14 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords evaluationframeworkaccuracyfailureachievesagentsanalysisdiagnosis

0 comments

The pith

Decomposing AI agent traces into independent spans enables precise failure diagnosis and higher accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a holistic evaluation framework for AI agents that combines top-down diagnosis with bottom-up span-level assessment by breaking traces into independent segments. This allows scaling to long sequences and providing specific rationales for each failure verdict. It demonstrates large gains over prior methods on the TRAIL benchmark for both GAIA and SWE-Bench tasks. The key insight is that evaluation methodology itself limits performance more than the underlying model's capability.

Core claim

Our framework pairs top-down agent-level diagnosis with bottom-up span-level evaluation by decomposing analysis into independent per-span assessments. This produces span-level rationales and scales to arbitrary trace lengths. On the TRAIL benchmark it achieves state-of-the-art results across all metrics on GAIA and SWE-Bench, with relative gains up to 38% on category F1, 3.5x on localization accuracy, and 12.5x on joint localization-categorization accuracy. The same frontier model achieves several times higher localization accuracy inside the framework than when used as a monolithic judge over the full trace.

What carries the argument

The decomposition of agent traces into independent per-span assessments for separate evaluation and rationale generation.

Load-bearing premise

That the causes of failures in agent traces can be accurately identified by assessing independent spans without needing the full surrounding context for interdependent steps.

What would settle it

A direct comparison showing that a monolithic full-trace judge achieves comparable or better localization and categorization accuracy on the TRAIL benchmark would falsify the advantage of span decomposition.

Figures

Figures reproduced from arXiv: 2605.14865 by Alon Mecilati, Amos Rimon, David Connack, Edo Dekel, Gilad Dym, Jonatan Liberman, Liron Schliesser, Max Svidlo, Netta Madvil, Orel Shalom, Philip Tannor, Rotem Brazilay, Shai Nir, Shir Chorev, Yaron Friedman.

**Figure 2.** Figure 2: Per-category F1 scores on the TRAIL benchmark. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 4.** Figure 4: Execution trace demonstrating inconsistent error span localization. The same “Formatting Errors” category is attributed to Tool spans in Steps 4–5 but to LLM spans in Steps 6–7, despite identical root cause: the LLM generating malformed arguments ({”: ”}) for the page_down tool. Span IDs are abbreviated; full trace ID: ef0207e4427fe22aeb1c2105932b74d7. The same malformed tool call pattern, the LLM generati… view at source ↗

**Figure 5.** Figure 5: A single LLM span annotated with both “Goal Deviation” and “Instruction Non-compliance.” Both annotations reference identical evidence and describe the same failure: skipping tool usage. Trace ID: dbc070b918d4a052c0b686081408fb52. Both annotations describe the same observation: the model bypassed tool usage. However, one frames it as “deviating from planned steps” while the other frames it as “not followin… view at source ↗

**Figure 6.** Figure 6: Trace structure for session b7f8fcd484777f9d330f24a2ff30dd25. One planning span is flagged with “instruction non compliance” for omitting a formatting tag, despite a perfect Overall (GT) = 5.0 score. Our Deepchecks Instruction Following metric (score: 3.0) independently identified the same issue: “...follows all structural and content rules except for omitting the mandatory <end_plan> tag.” Both the human … view at source ↗

read the original abstract

AI agents execute complex multi-step processes, but current evaluation falls short: outcome metrics report success or failure without explaining why, and process-level approaches struggle to connect failure types to their precise locations within long, structured traces. We present a holistic agent evaluation framework that pairs top-down agent-level diagnosis with bottom-up span-level evaluation, decomposing analysis into independent per-span assessments. This decomposition scales to traces of arbitrary length and produces span-level rationales for each verdict. On the TRAIL benchmark, our framework achieves state-of-the-art results across all metrics on both GAIA and SWE-Bench, with relative gains over the strongest prior baselines of up to 38% on category F1, up to 3.5x on localization accuracy, and up to 12.5x on joint localization-categorization accuracy. Per-category analysis shows our framework leading in more error categories than any other evaluator. Notably, the same frontier model achieves several times higher localization accuracy when used inside our framework than as a monolithic judge over the full trace, showing that evaluation methodology, not model capability, is the bottleneck.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Span decomposition for agent traces looks workable on the benchmarks but the independence assumption for cascading errors needs direct checks.

read the letter

The paper's core move is to split agent execution traces into spans, score each one bottom-up for failure type and location, then fold those into a top-down agent-level diagnosis. This produces per-span rationales and scales to long traces. On GAIA and SWE-Bench it reports clear lifts over prior evaluators, including up to 12.5x on joint localization-categorization accuracy, and the same model performs better inside the framework than as a monolithic judge. That last point is useful: it isolates methodology as the variable rather than just model size.

Referee Report

2 major / 1 minor

Summary. The paper introduces a holistic evaluation framework for AI agents that pairs top-down agent-level diagnosis with bottom-up span-level evaluation by decomposing long traces into independent per-span assessments. It reports state-of-the-art results on the TRAIL benchmark using GAIA and SWE-Bench, with relative gains over prior baselines of up to 38% on category F1, 3.5x on localization accuracy, and 12.5x on joint localization-categorization accuracy, and concludes that evaluation methodology rather than model scale is the primary bottleneck.

Significance. If the span-decomposition approach can be shown to recover accurate failure diagnoses even for interdependent errors, the framework would offer a scalable, interpretable method for diagnosing AI agent failures that could accelerate targeted improvements. The reported result that the same frontier model yields substantially higher localization accuracy inside the framework than as a monolithic judge would strengthen the case for investing in structured evaluation designs over raw capability scaling.

major comments (2)

[Abstract] Abstract: the central claim that per-span assessments produce accurate failure diagnoses (underpinning the 12.5x joint accuracy gain) assumes spans can be treated as independent. No validation is provided that these verdicts match full-trace human judgments on chained cases, such as an early planning error that invalidates subsequent tool calls, which are prevalent in GAIA and SWE-Bench traces.
[Evaluation] Evaluation section: the manuscript provides no details on data splits, span-boundary criteria, or error analysis stratified by interdependence, leaving open whether the reported SOTA gains are robust or specific to the chosen benchmarks and trace distributions.

minor comments (1)

The abstract and results paragraphs would benefit from explicit statements of the number of traces evaluated and the statistical significance of the reported relative gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional validation and methodological transparency would strengthen the paper. We address each major comment below and will revise the manuscript accordingly to incorporate the suggested clarifications and analyses.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that per-span assessments produce accurate failure diagnoses (underpinning the 12.5x joint accuracy gain) assumes spans can be treated as independent. No validation is provided that these verdicts match full-trace human judgments on chained cases, such as an early planning error that invalidates subsequent tool calls, which are prevalent in GAIA and SWE-Bench traces.

Authors: We agree that explicit validation of the independence assumption on interdependent (chained) errors is a valuable addition. The framework's design intentionally decomposes traces to enable scalable per-span judgments, but we recognize that direct comparison to full-trace human judgments on cases with early planning errors would better substantiate the reported gains. In the revised manuscript, we will add a dedicated analysis subsection that selects a stratified subset of GAIA and SWE-Bench traces exhibiting clear interdependencies, obtains full-trace human annotations, and reports agreement metrics between span-level verdicts and holistic judgments. This will include quantitative results on how early errors propagate and whether per-span evaluation still recovers accurate diagnoses. revision: yes
Referee: [Evaluation] Evaluation section: the manuscript provides no details on data splits, span-boundary criteria, or error analysis stratified by interdependence, leaving open whether the reported SOTA gains are robust or specific to the chosen benchmarks and trace distributions.

Authors: We will expand the Evaluation section with the requested details. Data splits will be specified (e.g., the exact partitioning or sampling procedure used for the TRAIL benchmark traces from GAIA and SWE-Bench). Span-boundary criteria will be formalized (e.g., boundaries defined at the level of individual reasoning steps, tool invocations, or sub-task completions, with examples). We will also add an error analysis stratified by interdependence, categorizing traces into low-, medium-, and high-interdependence groups based on manual review and reporting per-group performance for localization and categorization accuracy. These additions will demonstrate that the SOTA results hold across varying trace characteristics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results benchmarked on external GAIA/SWE-Bench

full rationale

The paper's central claims consist of empirical performance gains (up to 38% F1, 3.5x localization, 12.5x joint accuracy) measured on the independent external benchmarks GAIA and SWE-Bench via the TRAIL evaluation. No equations, fitted parameters, or self-citations are shown to define the reported metrics by construction. The span-decomposition methodology is introduced as an engineering choice whose value is assessed through those external outcomes rather than tautologically derived from prior author work. This places the work in the normal non-circular range (0-2).

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that per-span evaluations can be performed independently and still yield accurate overall diagnosis; no free parameters or invented entities with external evidence are described.

axioms (1)

domain assumption Agent execution traces can be decomposed into independent spans suitable for separate failure assessment
This underpins the bottom-up component and scalability claim in the abstract.

invented entities (1)

span-level assessment no independent evidence
purpose: To enable localized and scalable failure diagnosis with rationales
New evaluation unit introduced by the framework

pith-pipeline@v0.9.0 · 5543 in / 1260 out tokens · 68253 ms · 2026-05-15T03:14:09.258237+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

decomposing analysis into independent per-span assessments... scales to traces of arbitrary length
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

bottom-up evaluation computes metrics at leaf spans... hierarchical aggregation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

[1]

Barke et al

S. Barke et al. Agentrx: Diagnosing and repairing llm agent failures.arXiv preprint arXiv:2602.02475, 2026

work page arXiv 2026
[2]

Bhonsle et al

S. Bhonsle et al. Auto-eval judge: Automated evaluation of llm systems.arXiv preprint arXiv:2508.05508, 2025. 9

work page arXiv 2025
[3]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Ti- wari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent LLM systems fail?arXiv preprint arXiv:2503.13657, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Chen et al

Z. Chen et al. T-eval: Evaluating tool-augmented language models. InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2024. arXiv:2312.14033

work page arXiv 2024
[5]

CrewAI: Framework for orchestrating role-playing, autonomous AI agents.https: //www.crewai.com, 2024

CrewAI. CrewAI: Framework for orchestrating role-playing, autonomous AI agents.https: //www.crewai.com, 2024

work page 2024
[6]

Datta et al

S. Datta et al. Agent gpa: A general framework for evaluating llm agents.arXiv preprint arXiv:2510.08847, 2025

work page arXiv 2025
[7]

TRAIL: Trace reasoning and agentic issue localization.arXiv preprint arXiv:2505.08638, 2025

Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kannappan, and Rebecca Qian. TRAIL: Trace reasoning and agentic issue localization.arXiv preprint arXiv:2505.08638, 2025

work page arXiv 2025
[8]

Ge et al

L. Ge et al. Famas: Fine-grained multi-agent system evaluation.arXiv preprint arXiv:2509.13782, 2025

work page arXiv 2025
[9]

Agent Development Kit (ADK).https://google.github.io/adk-docs/, 2025

Google. Agent Development Kit (ADK).https://google.github.io/adk-docs/, 2025

work page 2025
[10]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInterna- tional Conference on Learning Representations, 2024

work page 2024
[11]

Kartik et al

A. Kartik et al. Agentcompass: Benchmarking and evaluating llm agents.arXiv preprint arXiv:2509.14647, 2025

work page arXiv 2025
[12]

LangGraph.https://www.langchain.com/langgraph, 2024

LangChain. LangGraph.https://www.langchain.com/langgraph, 2024

work page 2024
[13]

Lee et al

H. Lee et al. Agentdiagnose: A framework for diagnosing llm-based agents. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) Demos, 2025

work page 2025
[14]

TRACE: Trajectory-aware comprehensive evaluation for deep research agents

Yifan Li et al. TRACE: Trajectory-aware comprehensive evaluation for deep research agents. arXiv preprint arXiv:2602.21230, 2026

work page arXiv 2026
[15]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2023

work page 2023
[16]

AgentBench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InThe Twelfth International Conference on Learning R...

work page 2024
[17]

Pal, and Siva Reddy

Xing Han Lù, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina Stanczak, Peter Shaw, Christopher J. Pal, and Siva Reddy. AgentReward- Bench: Evaluating automatic evaluations of web agent trajectories. InConference on Language Modeling (COLM), 2025

work page 2025
[18]

Ma et al

X. Ma et al. Agentboard: An analytical evaluation board for multi-turn llm agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. Oral; arXiv:2401.13178

work page arXiv 2024
[19]

GAIA: a benchmark for general AI assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024
[20]

Semantic conventions for generative AI systems.https:// opentelemetry.io/docs/specs/semconv/gen-ai/, 2024

OpenTelemetry Authors. Semantic conventions for generative AI systems.https:// opentelemetry.io/docs/specs/semconv/gen-ai/, 2024. 10

work page 2024
[21]

Pan et al

X. Pan et al. Autonomous evaluation and refinement of digital agents. InConference on Language Modeling (COLM), 2024

work page 2024
[22]

Xue et al

Y. Xue et al. An illusion of progress? evaluating web agents with llm judges. InConference on Language Modeling (COLM), 2025. arXiv:2504.01382

work page arXiv 2025
[23]

Which agent causes task failures and when? on automated failure attribution of LLM multi-agent systems

Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, and Qingyun Wu. Which agent causes task failures and when? on automated failure attribution of LLM multi-agent systems. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025. Spotlight

work page 2025
[24]

Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025

Y. Zhang et al. Agentracer: Trace-based evaluation of llm agents.arXiv preprint arXiv:2509.03312, 2025

work page arXiv 2025
[25]

Gonzalez, and Ion Stoica

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and chatbot arena. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[26]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024
[27]

| +-- Tool Coverage (1.0):

Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, and Jürgen Schmidhuber. Agent-as-a-judge: Evaluate agents with agents. In International Conference on Machine Learning, 2025. 11 Appendix A Full Trace Example A.1 The ...

work page 2025
[28]

’Dragons Are Tricksy’: The Uncanny Dragons of Children’s Literature

[Fafnir Cover 2:2014](http://journal.finfar.org/articles/127.pdf) Source: finfar.org In the third and last article, "’Dragons Are Tricksy’: The Uncanny Dragons of Children’s Literature", Emily Midkiff discusses the representation of dragons in ... ... Analysis.The annotation contains multiple factual errors when compared to the span evidence: 1.“Empty str...

work page 2014