pith. machine review for the scientific record. sign in

arxiv: 2605.08715 · v2 · submitted 2026-05-09 · 💻 cs.CL · cs.AI· cs.MA

Recognition: no theorem link

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.MA
keywords multi-agent systemsonline auditingfailure predictionLLM agentsreinforcement learningtrajectory analysisearly interventionerror localization
0
0 comments X

The pith

A 7B model audits unfolding multi-agent trajectories online to flag the earliest decisive error using only the current prefix.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM-based multi-agent systems fail when one early mistake is accepted and cascades into full trajectory collapse. Existing approaches only diagnose the responsible step and agent after the run ends, losing any chance to intervene. This work reframes the problem as online auditing: at every step an auditor sees solely the prefix so far and must either let the trajectory continue or raise an alarm at the first decisive error. The authors curate AFTraj-2K with strictly filtered safe trajectories and step-level annotations on unsafe ones obtained by consensus of LLM judges. They then train AgentForesight-7B with a two-stage reinforcement-learning procedure that first installs a risk-anticipation prior on adjacent safe/unsafe prefixes and then sharpens localization with a three-axis reward covering the what, where, and who of each verdict. On AFTraj-2K and an external Who&When benchmark the 7B auditor exceeds GPT-4.1 and DeepSeek-V4-Pro by up to 19.9 percent while cutting step localization error by a factor of three.

Core claim

AgentForesight-7B performs online auditing on agentic trajectories by observing only the current prefix and deciding at each step whether to continue or to alarm at the earliest decisive error; it is obtained by first training a risk-anticipation prior on adjacent safe/unsafe prefix pairs and then refining that prior with a three-axis reward that jointly targets the what, where, and who of the audit verdict, yielding up to 19.9 percent higher performance and three-fold lower step localization error than leading proprietary models on AFTraj-2K and the Who&When benchmark.

What carries the argument

The coarse-to-fine reinforcement learning recipe that first equips the model with a risk-anticipation prior on adjacent safe/unsafe prefix pairs and then sharpens it into precise step-level localization under a three-axis reward jointly targeting the what, where, and who of an audit verdict.

If this is right

  • Multi-agent deployments can shift from post-hoc failure analysis to live intervention that stops trajectories before they collapse.
  • Auditing becomes possible at every step without access to future actions or outcomes.
  • A compact 7B model can surpass much larger proprietary systems on both detection accuracy and localization precision.
  • The same staged RL approach that first builds a boundary prior and then refines localization may apply to other sequential auditing tasks.
  • Trajectory-level success rates can improve by catching single decisive mistakes before downstream agents accept them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could be embedded inside existing agent runtimes to trigger human review or automatic rollback at the moment of alarm.
  • Because the auditor never sees future steps, it may generalize to open-ended real-world tasks where ground-truth failure is revealed only after the fact.
  • Extending the three-axis reward to include downstream impact estimates could further reduce false alarms while preserving early detection.
  • The curation pipeline's reliance on LLM judges suggests that scaling judge diversity or adding human verification loops would be a direct next measurement.

Load-bearing premise

The step-level annotations produced by consensus among multiple LLM judges correctly identify the decisive error without hindsight and the curation pipeline for safe trajectories produces data that generalizes to real deployments.

What would settle it

Run the trained auditor on a fresh set of held-out trajectories and measure whether it raises an alarm exactly at the human-identified first decisive error step and nowhere earlier, or whether it misses the error or produces false alarms.

Figures

Figures reproduced from arXiv: 2605.08715 by Boxuan Zhang, Dongfang Liu, Jianing Zhu, Ruixiang Tang, Zeru Shi.

Figure 1
Figure 1. Figure 1: Comparison of (a) post-hoc failure attribution and (b) online auditing on the same multi-agent task. (a) Post-hoc failure attribution inspects the trajectory only after it has failed and identifies the decisive error retrospectively, by which point downstream propagation has already locked in the failure. (b) Our AgentForesight instead evaluates each prefix as the trajectory unfolds and flags the decisive … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of AgentForesight. (a) The AFTRAJ-2K construction pipeline collects trajectories from off-the-shelf multi-agent systems across multiple domains, retains successful runs through a strict filtering pipeline, and produces failure runs via decisive-error injection and multi-judge voting verification. (b) A coarse-to-fine training recipe that first equips the auditor with a risk-anticipation prior on a… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation of the two-stage coarse-to￾fine recipe on AFTRAJ-2K, comparing +Stage 1, +Stage 2, and the full two-stage AgentForesight-7B. 0 20 40 60 80 100 False Alarm Rate (%) ↓ 0 10 20 30 40 50 60 70 S t e p A c c u r a c y (%) ↑ deployable better Llama3.2-3B Gemma3-4B Qwen2.5-7B-it Qwen3-8B GPT-4.1 Gemini-3-Flash Claude-Haiku-4.5 DeepSeek-V4-Flash DeepSeek-V4-Pro AgentForesight-7B (ours) [PITH_FULL_IMAGE:f… view at source ↗
Figure 5
Figure 5. Figure 5: Case study of online auditing, comparing predictions from DeepSeek-V4-Pro, Gemini-3-Flash, and AgentForesight-7B. Where strong baselines miss or mis￾locate [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Math case study comparing decisive-error verdicts from Gemini-3-Flash, GPT-4.1, and AgentForesight￾7B on a MATH-500 trajectory whose decisive error commits late at Step 6. Late-committing decisive errors [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
read the original abstract

LLM-based multi-agent systems are increasingly deployed on long-horizon tasks, but a single decisive error is often accepted by downstream agents and cascades into trajectory-level failure. Existing work frames this as \emph{post-hoc failure attribution}, diagnosing the responsible agent and step after the trajectory has ended. However, this paradigm forfeits any opportunity to intervene while trajectory is still unfolding. In this work, we introduce AgentForesight, a framework that reframes this problem as online auditing: at each step of an unfolding trajectory, an auditor observes only the current prefix and must either continue the run or alarm at the earliest decisive error, without access to future steps. To this end, we curate AFTraj-2K, a corpus of agentic trajectories across Coding, Math, and Agentic domains, in which safe trajectories are retained under a strict curation pipeline and unsafe trajectories are annotated at the step of their decisive error via consensus among multiple LLM judges. Built on that, we develop AgentForesight-7B, a compact online auditor trained with a coarse-to-fine reinforcement learning recipe that first equips it with a risk-anticipation prior at the failure boundary on adjacent safe/unsafe prefix pairs, then sharpens this prior into precise step-level localization under a three-axis reward jointly targeting the what, where, and who of an audit verdict. Across AFTraj-2K and an external Who\&When benchmark, AgentForesight-7B outperforms leading proprietary models, including GPT-4.1 and DeepSeek-V4-Pro, achieving up to +19.9% performance gain and 3$\times$ lower step localization error, opening the loop from post-hoc failures detection to enabling deployment-time intervention. Project page: https://zbox1005.github.io/agent-foresight/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AgentForesight, a framework that reframes failure attribution in LLM-based multi-agent systems as an online auditing task: at each step, an auditor sees only the current trajectory prefix and must decide whether to continue or raise an alarm at the earliest decisive error. The authors curate AFTraj-2K (safe trajectories retained via strict pipeline; unsafe trajectories annotated at the decisive error step by LLM-judge consensus) and train AgentForesight-7B via a coarse-to-fine RL recipe—first instilling a risk-anticipation prior on adjacent safe/unsafe prefix pairs, then refining it with a three-axis reward targeting the what/where/who of the verdict. Experiments on AFTraj-2K and the external Who&When benchmark report that the 7B model outperforms GPT-4.1 and DeepSeek-V4-Pro by up to +19.9% while reducing step-localization error by 3×.

Significance. If the central claims hold, the work would meaningfully advance reliable deployment of long-horizon multi-agent systems by shifting from post-hoc diagnosis to prefix-based early intervention. The coarse-to-fine RL recipe and the explicit separation of safe/unsafe prefix pairs constitute a concrete technical contribution; the reported gains over strong proprietary baselines on both internal and external benchmarks are noteworthy and would be of immediate interest to the community if the annotation pipeline can be shown to avoid hindsight leakage.

major comments (2)
  1. [Dataset curation and annotation pipeline] Dataset curation (AFTraj-2K construction): the decisive-error step labels are produced by LLM-judge consensus on complete trajectories. Because judges observe future steps unavailable to the online auditor, the resulting positive/negative boundary may encode information that does not exist at the prefix the model actually sees. This directly affects both stages of the coarse-to-fine RL recipe and is load-bearing for the claim of genuine early prediction rather than post-hoc mimicry.
  2. [Experiments and results] Evaluation protocol: the reported +19.9% gain and 3× localization improvement are presented without a detailed breakdown of how the external Who&When benchmark was adapted or whether its test prefixes were constructed to exclude any future-step information used in annotation. Without this, it is difficult to assess whether the performance edge generalizes to true online deployment.
minor comments (2)
  1. [Training recipe] The three-axis reward weights are listed as free parameters; a brief sensitivity analysis or default values should be provided so readers can reproduce the exact training signal.
  2. [Metrics] Clarify the precise definition of 'step localization error' (e.g., absolute step offset, normalized distance) and whether it is computed only on trajectories where an alarm is raised.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which highlight important considerations for ensuring the online nature of the auditing task. We address each major point below with clarifications and proposed revisions. These changes will strengthen the transparency of our dataset construction and evaluation protocol without altering the core technical contributions.

read point-by-point responses
  1. Referee: [Dataset curation and annotation pipeline] Dataset curation (AFTraj-2K construction): the decisive-error step labels are produced by LLM-judge consensus on complete trajectories. Because judges observe future steps unavailable to the online auditor, the resulting positive/negative boundary may encode information that does not exist at the prefix the model actually sees. This directly affects both stages of the coarse-to-fine RL recipe and is load-bearing for the claim of genuine early prediction rather than post-hoc mimicry.

    Authors: We agree this is a substantive concern for any prefix-based prediction task. Our annotation protocol instructs judges to identify the earliest step at which the trajectory becomes irreversibly unsafe based on observable divergence (e.g., incorrect code commit, flawed reasoning chain), with explicit prompts to avoid relying on later recovery attempts. Multiple judges reach consensus only when the failure is locally detectable from the prefix onward. Nevertheless, we acknowledge that full-trajectory context can subtly influence boundary placement. In the revision we will (1) release the exact judge prompts and consensus rules, (2) add an ablation comparing performance when labels are derived from prefix-only human review on a 200-trajectory subset, and (3) include a limitations paragraph quantifying residual leakage risk. These additions provide the requested transparency while preserving the claim that the auditor itself never sees future tokens. revision: partial

  2. Referee: [Experiments and results] Evaluation protocol: the reported +19.9% gain and 3× localization improvement are presented without a detailed breakdown of how the external Who&When benchmark was adapted or whether its test prefixes were constructed to exclude any future-step information used in annotation. Without this, it is difficult to assess whether the performance edge generalizes to true online deployment.

    Authors: We thank the referee for noting this omission. The Who&When test prefixes were created by truncating each trajectory exactly at the original annotation's decisive-error step, discarding all subsequent tokens; no post-failure information is present in any input the model receives. We will add a dedicated appendix subsection that (a) reproduces the truncation procedure, (b) provides pseudocode for prefix construction, and (c) reports a verification step confirming zero future-token leakage. This documentation will make the online evaluation protocol fully reproducible and directly address the generalization question. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper curates AFTraj-2K with LLM-judge annotations on full trajectories for decisive error steps, then trains AgentForesight-7B via a coarse-to-fine RL procedure on prefix observations only, followed by evaluation on AFTraj-2K and an external Who&When benchmark. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction or result to the inputs by construction; the training signal and performance metrics remain independent of any tautological redefinition of the target labels.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim relies on the assumption that the curated dataset reflects real failure points and that the RL training generalizes to new trajectories. Since only abstract is available, the ledger is based on described components: reliance on LLM-as-judge for labeling and the effectiveness of the coarse-to-fine RL.

free parameters (1)
  • three-axis reward weights
    The joint reward targeting what, where, and who of an audit verdict is likely tuned during the RL training process.
axioms (1)
  • domain assumption Consensus among multiple LLM judges accurately identifies the decisive error step in unsafe trajectories
    This is used to annotate the AFTraj-2K dataset as described in the abstract.

pith-pipeline@v0.9.0 · 5648 in / 1447 out tokens · 59623 ms · 2026-05-15T05:20:31.048676+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · 15 internal anchors

  1. [1]

    Introducing claude haiku 4.5

    Anthropic. Introducing claude haiku 4.5. https://www.anthropic.com/news/ claude-haiku-4-5, October 2025. Accessed: 2026-05-02

  2. [2]

    Monitoring reasoning models for misbehavior.arXiv preprint arXiv:2503.11926,

    Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation.arXiv preprint arXiv:2503.11926, 2025

  3. [3]

    Why Do Multi-Agent LLM Systems Fail?

    Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi- agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025

  4. [4]

    ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

    Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023

  5. [5]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  6. [6]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

  7. [7]

    Gemini 3 flash: Frontier intelligence built for speed

    Tulsee Doshi. Gemini 3 flash: Frontier intelligence built for speed. https://blog. google/products-and-platforms/products/gemini/gemini-3-flash/ , December

  8. [8]

    Accessed: 2026-05-01

    Google Blog. Accessed: 2026-05-01

  9. [9]

    Lm-polygraph: Uncertainty estimation for language models

    Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, et al. Lm-polygraph: Uncertainty estimation for language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages ...

  10. [10]

    Group-in-Group Policy Optimization for LLM Agent Training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025

  11. [11]

    Gemma 3 Technical Report

    Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, March 2025

  12. [12]

    Sciagents: automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials, 37(22):2413523, 2025

    Alireza Ghafarollahi and Markus J Buehler. Sciagents: automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials, 37(22):2413523, 2025

  13. [13]

    Robin: A multi-agent system for automating scientific discovery.arXiv preprint arXiv:2505.13400, 2025

    Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J Szostkiewicz, Jon M Laurent, Muhammed T Razzak, Andrew D White, Michaela M Hinks, and Samuel G Rodriques. Robin: A multi-agent system for automating scientific discovery.arXiv preprint arXiv:2505.13400, 2025

  14. [14]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  15. [15]

    A survey on llm-as-a-judge.The Innovation, 2024

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024. 10

  16. [16]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  17. [17]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  18. [18]

    Metagpt: Meta programming for a multi-agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations, 2023

  19. [19]

    Large Language Models Cannot Self-Correct Reasoning Yet

    Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet.arXiv preprint arXiv:2310.01798, 2023

  20. [20]

    Testing and under- standing erroneous planning in llm agents through synthesized user inputs.arXiv preprint arXiv:2404.17833, 2024

    Zhenlan Ji, Daoyuan Wu, Pingchuan Ma, Zongjie Li, and Shuai Wang. Testing and under- standing erroneous planning in llm agents through synthesized user inputs.arXiv preprint arXiv:2404.17833, 2024

  21. [21]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  22. [22]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

  23. [23]

    Reliable weak-to-strong monitoring of llm agents.arXiv preprint arXiv:2508.19461, 2025

    Neil Kale, Chen Bo Calvin Zhang, Kevin Zhu, Ankit Aich, Paula Rodriguez, Scale Red Team, Christina Q Knight, and Zifan Wang. Reliable weak-to-strong monitoring of llm agents.arXiv preprint arXiv:2508.19461, 2025

  24. [24]

    Shade-arena: Evaluating sabotage and monitoring in llm agents.arXiv preprint arXiv:2506.15740, 2025

    Jonathan Kutasov, Yuqi Sun, Paul Colognese, Teun van der Weij, Linda Petrini, Chen Bo Calvin Zhang, John Hughes, Xiang Deng, Henry Sleight, Tyler Tracy, et al. Shade-arena: Evaluating sabotage and monitoring in llm agents.arXiv preprint arXiv:2506.15740, 2025

  25. [25]

    Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

    Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

  26. [26]

    ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

    Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, et al. Atbench: A diverse and realistic trajectory benchmark for long-horizon agent safety.arXiv preprint arXiv:2604.02022, 2026

  27. [27]

    In-the-flow agentic system optimization for effective planning and tool use.arXiv preprint arXiv:2510.05592, 2025

    Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, and Pan Lu. In-the-flow agentic system optimization for effective planning and tool use.arXiv preprint arXiv:2510.05592, 2025

  28. [28]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

  29. [29]

    Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990, 2025

    Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, et al. Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990, 2025

  30. [30]

    Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in neural information processing systems, 36:21558–21572, 2023. 11

  31. [31]

    Llm collaboration with multi- agent reinforcement learning

    Shuo Liu, Zeyu Liang, Xueguang Lyu, and Christopher Amato. Llm collaboration with multi- agent reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 32150–32158, 2026

  32. [32]

    G-eval: Nlg evaluation using gpt-4 with better human alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 2511–2522, 2023

  33. [33]

    Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

  34. [34]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

  35. [35]

    Selfcheck: Using llms to zero-shot check their own step-by-step reasoning.arXiv preprint arXiv:2308.00436, 2023

    Ning Miao, Yee Whye Teh, and Tom Rainforth. Selfcheck: Using llms to zero-shot check their own step-by-step reasoning.arXiv preprint arXiv:2308.00436, 2023

  36. [36]

    Defining and detecting the defects of large language model-based autonomous agents.IEEE Transactions on Software Engineering, 2026

    Kaiwen Ning, Jiachi Chen, Jingwen Zhang, Wei Li, Zexu Wang, Yuming Feng, Weizhe Zhang, and Zibin Zheng. Defining and detecting the defects of large language model-based autonomous agents.IEEE Transactions on Software Engineering, 2026

  37. [37]

    Gpt-5 system card

    OpenAI. Gpt-5 system card. https://openai.com/index/gpt-5-system-card/ , August

  38. [38]

    Accessed: 2026-05-01

    System card. Accessed: 2026-05-01

  39. [39]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  40. [40]

    Memgpt: towards llms as operating systems

    Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonza- lez. Memgpt: towards llms as operating systems. 2023

  41. [41]

    The why behind the action: Unveiling internal drivers via agentic attribution.arXiv preprint arXiv:2601.15075, 2026

    Chen Qian, Peng Wang, Dongrui Liu, Junyao Yang, Dadi Guo, Ling Tang, Jilin Mei, Qihan Ren, Shuai Shao, Yong Liu, et al. The why behind the action: Unveiling internal drivers via agentic attribution.arXiv preprint arXiv:2601.15075, 2026

  42. [42]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  43. [43]

    smolagents: A smol library to build great agentic systems.Hugging Face, 2025

    Aymeric Roucher, A Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunis- mäki. smolagents: A smol library to build great agentic systems.Hugging Face, 2025

  44. [44]

    Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

  45. [45]

    Stress testing deliberative alignment for anti-scheming training.arXiv preprint arXiv:2509.15541, 2025

    Bronson Schoen, Evgenia Nitishinskaya, Mikita Balesni, Axel Højmark, Felix Hofstätter, Jérémy Scheurer, Alexander Meinke, Jason Wolfe, Teun van der Weij, Alex Lloyd, et al. Stress testing deliberative alignment for anti-scheming training.arXiv preprint arXiv:2509.15541, 2025

  46. [46]

    Approximating KL divergence

    John Schulman. Approximating KL divergence. http://joschu.net/blog/kl-approx. html, 2020. Blog post

  47. [47]

    Your agent may misevolve: Emergent risks in self-evolving llm agents.arXiv preprint arXiv:2509.26354, 2025

    Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, et al. Your agent may misevolve: Emergent risks in self-evolving llm agents.arXiv preprint arXiv:2509.26354, 2025. 12

  48. [48]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

  49. [49]

    From commands to prompts: Llm-based semantic file system for aios

    Zeru Shi, Kai Mei, Mingyu Jin, Yongye Su, Chaoji Zuo, Wenyue Hua, Wujiang Xu, Yujie Ren, Zirui Liu, Mengnan Du, et al. From commands to prompts: Llm-based semantic file system for aios. InInternational Conference on Learning Representations, volume 2025, pages 33108–33131, 2025

  50. [50]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

  51. [51]

    Verila: A human-centered evaluation framework for interpretable verification of llm agent failures.arXiv preprint arXiv:2503.12651, 2025

    Yoo Yeon Sung, Hannah Kim, and Dan Zhang. Verila: A human-centered evaluation framework for interpretable verification of llm agent failures.arXiv preprint arXiv:2503.12651, 2025

  52. [52]

    Math-shepherd: Verify and reinforce llms step-by-step without human annotations

    Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024

  53. [53]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

  54. [54]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  55. [55]

    Autogen: Enabling next-gen llm applications via multi-agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

  56. [56]

    Agentgym-rl: Training LLM agents for long-horizon decision making through multi-turn reinforcement learning

    Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, et al. Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning.arXiv preprint arXiv:2509.08755, 2025

  57. [57]

    Agentprm: Process reward models for llm agents via step-wise promise and progress

    Zhiheng Xi, Chenyang Liao, Guanyu Li, Zhihao Zhang, Wenxiang Chen, Binghai Wang, Senjie Jin, Yuhao Zhou, Jian Guan, Wei Wu, et al. Agentprm: Process reward models for llm agents via step-wise promise and progress. InProceedings of the ACM Web Conference 2026, pages 4184–4195, 2026

  58. [58]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  59. [59]

    Hotpotqa: A dataset for diverse, explainable multi-hop question answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

  60. [60]

    Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

  61. [61]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  62. [62]

    Dive into the agent matrix: A realistic evaluation of self-replication risk in llm agents.arXiv preprint arXiv:2509.25302, 2025

    Boxuan Zhang, Yi Yu, Jiaxuan Guo, and Jing Shao. Dive into the agent matrix: A realistic evaluation of self-replication risk in llm agents.arXiv preprint arXiv:2509.25302, 2025. 13

  63. [63]

    Cot-uq: Improving response-wise uncertainty quantification in llms with chain-of-thought

    Boxuan Zhang and Ruqi Zhang. Cot-uq: Improving response-wise uncertainty quantification in llms with chain-of-thought. InFindings of the Association for Computational Linguistics: ACL 2025, pages 26114–26133, 2025

  64. [64]

    Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025

    Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng Yan. Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025

  65. [65]

    Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212, 2025

    Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, et al. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212, 2025

  66. [66]

    Processbench: Identifying process errors in mathematical reasoning

    Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Processbench: Identifying process errors in mathematical reasoning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1009–1024, 2025

  67. [67]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

  68. [68]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

  69. [69]

    Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

    Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al. Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

  70. [70]

    Language agents as optimizable graphs.arXiv preprint arXiv:2402.16823, 2024

    Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Language agents as optimizable graphs.arXiv preprint arXiv:2402.16823, 2024

  71. [71]

    Does this step contain a critical error? Answer with only ‘yes’ or ‘no’

    Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. Agent- as-a-judge: Evaluate agents with agents.arXiv preprint arXiv:2410.10934, 2024. 14 Appendices A Algorithmic Pipeline 16 B Additional Experiment Setups 17 B.1 Details of Datasets . . ....

  72. [72]

    Walk through each visible step chronologically

  73. [73]

    For each agent action, ask: was this action appropriate given prior context? Did any tool result reveal information the agent ignored?

  74. [74]

    Identify the EARLIEST decisive error supported by the visible evidence, if any

  75. [75]

    step 3") and agent names (e.g

    If no step in the visible window contains a decisive error, answer SAFE. ## Response Format (STRICT) Your response MUST follow this exact two-block format: <think> Walk through the visible trajectory step-by-step. Reference specific step numbers (e.g. "step 3") and agent names (e.g. "TaskSolver", "Geography_Expert"). State whether a decisive error is supp...

  76. [76]

    It is a SUBSTANTIVE error, not a superficial formatting issue

  77. [77]

    It is a DECISIVE error: correcting it would likely prevent the failure

  78. [78]

    It is as EARLY as possible, but still genuinely causal

  79. [79]

    The mistake_step MUST be an exact step number from the trajectory

  80. [80]

    The mistake_agent MUST exactly match the agent at that step

Showing first 80 references.