pith. machine review for the scientific record. sign in

arxiv: 2604.16706 · v1 · submitted 2026-04-17 · 💻 cs.AI · cs.CL· cs.MA

Recognition: unknown

Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench

Bhaskar Gurram

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:58 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.MA
keywords LLM agentstool useevaluation reliabilityerror propagationruntime mitigationhuman validationhallucinationbenchmark
0
0 comments X

The pith

Human-validated checks show automated judges for tool-using LLM agents are barely better than chance, with parameter errors leading to wrong answers about 62 percent of the time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the common assumption that automated evaluation can reliably assess tool-using language model agents by creating a benchmark with human labels on a 100-trace subset. It reports that simple substring matching agrees with humans only at chance level while a three-LLM ensemble reaches moderate agreement but with a conservative bias. The work also measures how often an injected bad tool parameter cascades to an incorrect final answer once calibrated against the human judgments, and it tests whether a runtime interceptor can catch or correct such errors before they propagate. A sympathetic reader cares because many current agent benchmarks and safety claims rest on unvalidated automated scoring that may overstate real performance.

Core claim

Using the AgentProp-Bench of 2,300 traces across nine models and four domains together with its 100 human-validated labels, the authors find substring-based judging agrees with humans at kappa=0.049 while a three-LLM ensemble reaches kappa=0.432 with conservative bias. Parameter-level injection propagates to a wrong final answer at a human-calibrated probability of approximately 0.62. Rejection of bad parameters and recovery after acceptance are independent model capabilities. A tuned runtime interceptor reduces hallucination by 23 percentage points on GPT-4o-mini under controlled conditions but produces no significant change on Gemini-2.0-Flash.

What carries the argument

The 100-trace human-validated subset of AgentProp-Bench, which calibrates judge reliability and supplies ground-truth probabilities for measuring how parameter errors propagate through agent traces to final answers.

If this is right

  • Substring-based automated judges should not be trusted for agent evaluation because they perform at chance level against humans.
  • Ensemble LLM judges offer moderate reliability but still require bias correction when used at scale.
  • Models must be trained separately on parameter rejection and on recovery because the two skills do not correlate.
  • Runtime interception provides a practical mitigation for hallucination in models that accept bad parameters but is unnecessary for models that already reject aggressively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Current agent benchmarks that rely solely on automated judges likely overstate performance because those judges have low human agreement.
  • Agent designs should embed parameter validation before any tool call rather than relying only on post-hoc recovery.
  • The model-specific effectiveness of the interceptor suggests that future work should map which architectures naturally avoid the target failure mode.
  • Extending human validation to the remaining traces would allow tighter confidence intervals on the reported propagation rates.

Load-bearing premise

The 100 human-labeled traces are representative of the full 2,300 traces for calibrating judge agreement and error propagation rates.

What would settle it

A fresh human annotation of a substantially larger random sample of traces that produces a propagation probability outside the 0.46-0.73 range or judge kappas differing markedly from 0.049 and 0.432.

Figures

Figures reproduced from arXiv: 2604.16706 by Bhaskar Gurram.

Figure 1
Figure 1. Figure 1: Per-model stage-reach rates under P2 semantic-wrong injection (ensemble-judged, [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Rejection rate (1−pS1 ) vs. recovery rate (1−r2,3) for nine models. Point size proportional to nS2 . Spearman ρ=0.126, p=0.747: axes are statistically independent. Honest limitation. The −23.0 pp delta is measured under heuristic judging applied identically to both arms. The delta is robust to shared judging bias (any systematic over- or under-counting affects both arms equally). However, the absolute hall… view at source ↗
Figure 3
Figure 3. Figure 3: Interceptor effect under concurrent n=600 control. GPT-4o-mini shows a significant 23 pp reduction; Gemini-2.0-Flash shows no significant effect. Error bars: 95% bootstrap CIs. Why heuristic evaluation misranks models. The heuristic over-counted correctness by 34 percentage points on P2 traces, enough to reverse the direction of the baseline-vs-P2 comparison. This occurs because injected errors produce par… view at source ↗
Figure 4
Figure 4. Figure 4: Cohen’s κ vs. 100 human labels for each judge method. The heuristic is at chance level (κ=0.049); the ensemble reaches moderate agreement (κ=0.432). Error bars: 95% bootstrap CIs. 7.3 Implications for IR and IS evaluation Substring heuristics should be deprecated. Our κ=0.049 result is the lowest heuristic-vs￾human agreement we are aware of in any agent or QA evaluation study. The IR community has already … view at source ↗
read the original abstract

Automated evaluation of tool-using large language model (LLM) agents is widely assumed to be reliable, but this assumption has rarely been validated against human annotation. We introduce AgentProp-Bench, a 2,000-task benchmark with 2,300 traces across four domains, nine production LLMs, and a 100-label human-validated subset. We quantify judge reliability, characterize error propagation, and evaluate a runtime mitigation. Substring-based judging agrees with human annotation at kappa=0.049 (chance-level); a three-LLM ensemble reaches kappa=0.432 (moderate) with a conservative bias. Under validated evaluation, a parameter-level injection propagates to a wrong final answer with human-calibrated probability approximately 0.62 (range 0.46-0.73 across models). Rejection (catching bad parameters) and recovery (correcting after acceptance) are independent model capabilities (Spearman rho=0.126, p=0.747). A tuned runtime interceptor reduces hallucination on GPT-4o-mini by 23.0 percentage points under a concurrent n=600 control, but shows no significant effect on Gemini-2.0-Flash, whose aggressive parameter rejection eliminates the target failure mode. All code, data, traces, and human labels are released at https://github.com/bhaskargurram-ai/agenthallu-bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AgentProp-Bench, a benchmark of 2,300 traces from tool-using LLM agents across nine production models and four domains, including a 100-label human-validated subset. It quantifies automated judge reliability (substring-based kappa=0.049; three-LLM ensemble kappa=0.432 with conservative bias), measures error propagation from parameter-level injections to incorrect final answers (human-calibrated probability ~0.62, range 0.46-0.73), shows that rejection and recovery are statistically independent capabilities (Spearman rho=0.126, p=0.747), and evaluates a tuned runtime interceptor that reduces hallucination by 23.0 percentage points on GPT-4o-mini under a concurrent n=600 control (no significant effect on Gemini-2.0-Flash). All code, data, traces, and human labels are released publicly.

Significance. If the empirical findings hold, the work supplies concrete, reproducible measurements of judge reliability and error propagation in agentic tool use, along with a practical runtime mitigation whose differential effects across models are documented. The public release of the full dataset and labels is a clear strength that enables independent verification and extension.

major comments (2)
  1. [Abstract / human annotation section] The description of the 100-label human-validated subset (Abstract and the human annotation section) provides no selection criteria, sampling method, or balance statistics across domains, models, or error types. Because the reported kappa values and the human-calibrated propagation probability of ~0.62 are obtained by extrapolating from this subset to the full 2,300 traces, the absence of explicit stratification or random-sampling documentation makes the representativeness assumption unverified and load-bearing for the central quantitative claims.
  2. [Mitigation experiment results] Table or figure reporting the mitigation results (the 23.0 pp reduction on GPT-4o-mini and null result on Gemini-2.0-Flash) does not state whether the concurrent n=600 control was applied identically to both models or whether the interceptor tuning was performed on the same data partition used for the propagation analysis; this detail is required to interpret the model-specific outcomes.
minor comments (2)
  1. [Abstract] The benchmark is described as both a '2,000-task benchmark' and containing '2,300 traces'; a brief clarification of the relationship between tasks and traces would improve precision.
  2. All reported probabilities and kappa values should be accompanied by confidence intervals or standard errors in the main text and tables to allow readers to assess precision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify key aspects of our experimental design and reporting. We address each major comment below and will incorporate the requested clarifications into the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract / human annotation section] The description of the 100-label human-validated subset (Abstract and the human annotation section) provides no selection criteria, sampling method, or balance statistics across domains, models, or error types. Because the reported kappa values and the human-calibrated propagation probability of ~0.62 are obtained by extrapolating from this subset to the full 2,300 traces, the absence of explicit stratification or random-sampling documentation makes the representativeness assumption unverified and load-bearing for the central quantitative claims.

    Authors: We agree that explicit documentation of the sampling procedure is necessary to support the extrapolation from the 100-label subset. The subset was constructed via stratified random sampling across the four domains and nine models, with inclusion criteria limited to traces containing at least one tool call and a final answer; we will add a new subsection in the human annotation section that reports the exact sampling method, inclusion criteria, and balance statistics (e.g., counts per domain and per model). These details were computed during data collection and will be included in the revision to make the representativeness assumption verifiable. revision: yes

  2. Referee: [Mitigation experiment results] Table or figure reporting the mitigation results (the 23.0 pp reduction on GPT-4o-mini and null result on Gemini-2.0-Flash) does not state whether the concurrent n=600 control was applied identically to both models or whether the interceptor tuning was performed on the same data partition used for the propagation analysis; this detail is required to interpret the model-specific outcomes.

    Authors: We acknowledge the need for these experimental controls to be stated explicitly. The n=600 concurrent control was applied identically to both models using the same data-collection protocol and time window. The interceptor was tuned on a held-out partition that does not overlap with the traces used for the propagation analysis. In the revision we will add these statements to the mitigation-experiment subsection and include a clarifying sentence in the relevant table caption. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or self-referential reductions

full rationale

The paper reports direct empirical results: kappa agreements between automated judges and human labels on a 100-label subset, human-calibrated propagation probabilities (~0.62) computed from observed traces, Spearman correlations between rejection/recovery capabilities, and measured percentage-point reductions from a runtime interceptor under controlled conditions. These quantities are obtained by counting, comparing, and averaging over the released traces and labels; they do not reduce to any equation, fitted parameter, or self-citation that is itself defined by the paper's outputs. No derivation chain, ansatz, uniqueness theorem, or renaming of known results is present. The representativeness of the 100-label subset is a validity assumption, not a circularity mechanism.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on human annotations serving as ground truth and on standard statistical assumptions for agreement metrics; no free parameters, ad-hoc axioms, or invented entities are introduced.

axioms (2)
  • standard math Cohen's kappa measures inter-rater agreement beyond chance
    Used to quantify judge reliability against human labels
  • standard math Spearman rank correlation tests independence of model capabilities
    Applied to rejection and recovery behaviors

pith-pipeline@v0.9.0 · 5554 in / 1491 out tokens · 38413 ms · 2026-05-10T07:58:05.008731+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 8 internal anchors

  1. [1]

    Krisztian Balog, Donald Metzler, and Zhen Qin

    URL https://dl.acm.org/doi/ 10.1145/3673791.3698410. Krisztian Balog, Donald Metzler, and Zhen Qin. Rankers, judges, and assistants: Towards under- standing the interplay of LLMs in information retrieval evaluation.arXiv preprint,

  2. [2]

    Rankers, judges, and assis- tants: Towards understanding the interplay of llms in information retrieval evaluation, 2025

    URL https://arxiv.org/abs/2503.19092. arXiv:2503.19092. Victor Barres, Hao Dong, Shubham Ray, Xin Si, and Karthik Narasimhan. τ 2-bench: Evaluating conversational agents in a dual-control environment.arXiv preprint,

  3. [3]

    $\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

    URL https://arxiv. org/abs/2506.07982. arXiv:2506.07982. Anna Bavaresco, Alberto Testoni, Massimo Poesio, Silviu Paun, Alexandra Uma, Tommaso For- naciari, Dirk Hovy, Barbara Plank, and Raffaella Bernardi. LLMs instead of human judges? a large scale empirical study across 20 NLP evaluation tasks.arXiv preprint,

  4. [4]

    LLMs instead of Human Judges?

    URL https://arxiv.org/abs/2406.18403. arXiv:2406.18403. Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46,

  5. [5]

    A Coefficient of Agreement for Nominal Scales.Educational and Psychological Measurement, 20 (1):37–46, 1960

    doi: 10.1177/001316446002000104. Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. RAGAS: Automated evaluation of retrieval augmented generation. InProceedings of the 18th Conference of the European Chapter 16 of the Association for Computational Linguistics: System Demonstrations (EACL),

  6. [6]

    RAGAS: Automated evalu- ation of retrieval-augmented generation.arXiv preprint arXiv:2309.15217,

    URL https://aclanthology.org/2024.eacl-demo.16/. arXiv:2309.15217. Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Semantic entropy for detecting confabulations in large language models.Nature,

  7. [7]

    Nature , author =

    doi: 10.1038/s41586-024-07421-0. URL https://www.nature.com/articles/s41586-024-07421-0. Xiang Fu et al. How reliable is multilingual LLM-as-a-judge? InFindings of the Association for Computational Linguistics: EMNLP 2025,

  8. [8]

    Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025

    URL https://aclanthology.org/2025. findings-emnlp.587/. arXiv:2505.12434. Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on LLM-as-a-judge.arXiv preprint,

  9. [9]

    A Survey on LLM-as-a-Judge

    URL https: //arxiv.org/abs/2411.15594. arXiv:2411.15594. Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems,

  10. [10]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    URLhttps://arxiv.org/abs/2311.05232. arXiv:2311.05232. J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1):159–174,

  11. [11]

    Richard Landis and Gary G

    doi: 10.2307/2529310. Xiaotian Lin et al. LLM-based agents suffer from hallucinations: A survey of taxonomy, methods, and directions.arXiv preprint,

  12. [12]

    LLM-based agents suffer from hallucinations: A survey of taxonomy, methods, and directions.arXiv preprint arXiv:2509.18970, 2025

    URL https://arxiv.org/abs/2509.18970. arXiv:2509.18970. Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agen...

  13. [13]

    URL https://arxiv.org/abs/2308. 03688. arXiv:2308.03688. Xiao Liu, Xinyue Yang, Zhanhui Li, Peng Li, and Ruifeng He. AgentHallu: Benchmarking automated hallucination attribution of LLM-based agents.arXiv preprint,

  14. [14]

    InProceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland

    URL https: //arxiv.org/abs/2601.06818. arXiv:2601.06818. Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP),

  15. [15]

    Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

    URL https: //arxiv.org/abs/2303.08896. arXiv:2303.08896. MCPAgentBench Team. MCPAgentBench: A real-world task benchmark for evaluating LLM agent MCP tool use.arXiv preprint,

  16. [16]

    arXiv:2512.24565

    URL https://arxiv.org/abs/2512.24565. arXiv:2512.24565. Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, KaShun Shum, Randy Zhong, Juntong Song, and Tong Zhang. RAGTruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL),

  17. [17]

    doi:10.48550/arXiv.2401.00396 , abstract =

    URLhttps://arxiv.org/abs/2401.00396. arXiv:2401.00396. 17 Hossein A. Rahmani, Nick Craswell, Emine Yilmaz, Bhaskar Mitra, and Daniel Campos. Synthetic test collections for retrieval evaluation. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR),

  18. [18]

    In: Yang, G.H., Wang, H., Han, S., Hauff, C., Zuccon, G., Zhang, Y

    doi: 10.1145/ 3626772.3657942. URLhttps://dl.acm.org/doi/10.1145/3626772.3657942. Alireza Salemi and Hamed Zamani. Evaluating retrieval quality in retrieval-augmented generation. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR),

  19. [19]

    Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =

    doi: 10.1145/3626772.3657957. URL https://dl.acm. org/doi/10.1145/3626772.3657957. Kayla Schroeder and Zach Wood-Doughty. Can you trust LLM judgments? reliability of LLM-as-a- judge.arXiv preprint,

  20. [20]

    arXiv preprint arXiv:2412.12509 , year =

    URLhttps://arxiv.org/abs/2412.12509. arXiv:2412.12509. Ian Soboroff. Don’t use LLMs to make relevance judgments.arXiv preprint,

  21. [21]

    arXiv:2409.15133

    URL https: //arxiv.org/abs/2409.15133. arXiv:2409.15133. Aman Singh Thakur, Kartik Choudhary, Venkat Srinik, and Dieuwke Hupkes. Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges. InFindings of the Association for Computational Linguistics: ACL 2025,

  22. [22]

    URL https://aclanthology.org/2025.gem-1. 33/. arXiv:2406.12624. Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. Large language models can accurately predict searcher preferences. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR),

  23. [23]

    In: Yang, G.H., Wang, H., Han, S., Hauff, C., Zuccon, G., Zhang, Y

    doi: 10.1145/ 3626772.3657707. URLhttps://dl.acm.org/doi/10.1145/3626772.3657707. Haoyu Wang et al. AgentSpec: Customizable runtime enforcement for safe and reliable LLM agents. InProceedings of the 48th International Conference on Software Engineering (ICSE),

  24. [24]

    AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

    URL https://arxiv.org/html/2503.18666v1. arXiv:2503.18666. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InProceedings of the Eleventh International Conference on Learning Representations (ICLR),

  25. [25]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    URLhttps://arxiv.org/abs/2203.11171. arXiv:2203.11171. Xiao Xie, Xinle Li, Hanyu Wang, Zhiyu Yang, Qin Lv, and Li Yu. A survey of large language model empowered agents for recommendation and search: Towards next-generation information retrieval. arXiv preprint,

  26. [26]

    On generative agents in recommendation

    URLhttps://arxiv.org/abs/2503.05659. arXiv:2503.05659. Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. The Berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. InProceedings of the 42nd International Conference on Machine Learning (ICML),

  27. [27]

    Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones?

    doi: 10.48550/arXiv.2502.19557. URL https://openreview. net/forum?id=2GmDdhBdDk. Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint,

  28. [28]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    URL https://arxiv. org/abs/2406.12045. arXiv:2406.12045. Asaf Yehudai, Lilach Eden, Alon Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, and Michal Shmueli-Scheuer. Survey on evaluation of LLM-based agents.arXiv preprint,

  29. [29]

    Survey on Evaluation of LLM-based Agents

    URL https://arxiv.org/abs/2503.16416. arXiv:2503.16416. 18 Zechen Zhang, Xiaoguang Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jian Zhu, Zhenhua Dong, and Ji-Rong Wen. Large language models for information retrieval: A survey.arXiv preprint,

  30. [30]

    Preprint, arXiv:2308.07107

    URLhttps://arxiv.org/abs/2308.07107. arXiv:2308.07107. Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems,

  31. [31]

    ACM Transactions on Information Systems , year =

    doi: 10.1145/3748302. URL https://dl.acm. org/doi/10.1145/3748302. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems 36...

  32. [32]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    URL https://arxiv.org/abs/2306.05685. arXiv:2306.05685. 19