arxiv: 2605.08715 · v2 · submitted 2026-05-09 · 💻 cs.CL · cs.AI· cs.MA

Recognition: no theorem link

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

Boxuan Zhang , Jianing Zhu , Zeru Shi , Dongfang Liu , Ruixiang Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.MA

keywords multi-agent systemsonline auditingfailure predictionLLM agentsreinforcement learningtrajectory analysisearly interventionerror localization

0 comments

The pith

A 7B model audits unfolding multi-agent trajectories online to flag the earliest decisive error using only the current prefix.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM-based multi-agent systems fail when one early mistake is accepted and cascades into full trajectory collapse. Existing approaches only diagnose the responsible step and agent after the run ends, losing any chance to intervene. This work reframes the problem as online auditing: at every step an auditor sees solely the prefix so far and must either let the trajectory continue or raise an alarm at the first decisive error. The authors curate AFTraj-2K with strictly filtered safe trajectories and step-level annotations on unsafe ones obtained by consensus of LLM judges. They then train AgentForesight-7B with a two-stage reinforcement-learning procedure that first installs a risk-anticipation prior on adjacent safe/unsafe prefixes and then sharpens localization with a three-axis reward covering the what, where, and who of each verdict. On AFTraj-2K and an external Who&When benchmark the 7B auditor exceeds GPT-4.1 and DeepSeek-V4-Pro by up to 19.9 percent while cutting step localization error by a factor of three.

Core claim

AgentForesight-7B performs online auditing on agentic trajectories by observing only the current prefix and deciding at each step whether to continue or to alarm at the earliest decisive error; it is obtained by first training a risk-anticipation prior on adjacent safe/unsafe prefix pairs and then refining that prior with a three-axis reward that jointly targets the what, where, and who of the audit verdict, yielding up to 19.9 percent higher performance and three-fold lower step localization error than leading proprietary models on AFTraj-2K and the Who&When benchmark.

What carries the argument

The coarse-to-fine reinforcement learning recipe that first equips the model with a risk-anticipation prior on adjacent safe/unsafe prefix pairs and then sharpens it into precise step-level localization under a three-axis reward jointly targeting the what, where, and who of an audit verdict.

If this is right

Multi-agent deployments can shift from post-hoc failure analysis to live intervention that stops trajectories before they collapse.
Auditing becomes possible at every step without access to future actions or outcomes.
A compact 7B model can surpass much larger proprietary systems on both detection accuracy and localization precision.
The same staged RL approach that first builds a boundary prior and then refines localization may apply to other sequential auditing tasks.
Trajectory-level success rates can improve by catching single decisive mistakes before downstream agents accept them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be embedded inside existing agent runtimes to trigger human review or automatic rollback at the moment of alarm.
Because the auditor never sees future steps, it may generalize to open-ended real-world tasks where ground-truth failure is revealed only after the fact.
Extending the three-axis reward to include downstream impact estimates could further reduce false alarms while preserving early detection.
The curation pipeline's reliance on LLM judges suggests that scaling judge diversity or adding human verification loops would be a direct next measurement.

Load-bearing premise

The step-level annotations produced by consensus among multiple LLM judges correctly identify the decisive error without hindsight and the curation pipeline for safe trajectories produces data that generalizes to real deployments.

What would settle it

Run the trained auditor on a fresh set of held-out trajectories and measure whether it raises an alarm exactly at the human-identified first decisive error step and nowhere earlier, or whether it misses the error or produces false alarms.

Figures

Figures reproduced from arXiv: 2605.08715 by Boxuan Zhang, Dongfang Liu, Jianing Zhu, Ruixiang Tang, Zeru Shi.

**Figure 1.** Figure 1: Comparison of (a) post-hoc failure attribution and (b) online auditing on the same multi-agent task. (a) Post-hoc failure attribution inspects the trajectory only after it has failed and identifies the decisive error retrospectively, by which point downstream propagation has already locked in the failure. (b) Our AgentForesight instead evaluates each prefix as the trajectory unfolds and flags the decisive … view at source ↗

**Figure 2.** Figure 2: Overview of AgentForesight. (a) The AFTRAJ-2K construction pipeline collects trajectories from off-the-shelf multi-agent systems across multiple domains, retains successful runs through a strict filtering pipeline, and produces failure runs via decisive-error injection and multi-judge voting verification. (b) A coarse-to-fine training recipe that first equips the auditor with a risk-anticipation prior on a… view at source ↗

**Figure 3.** Figure 3: Ablation of the two-stage coarse-tofine recipe on AFTRAJ-2K, comparing +Stage 1, +Stage 2, and the full two-stage AgentForesight-7B. 0 20 40 60 80 100 False Alarm Rate (%) ↓ 0 10 20 30 40 50 60 70 S t e p A c c u r a c y (%) ↑ deployable better Llama3.2-3B Gemma3-4B Qwen2.5-7B-it Qwen3-8B GPT-4.1 Gemini-3-Flash Claude-Haiku-4.5 DeepSeek-V4-Flash DeepSeek-V4-Pro AgentForesight-7B (ours) [PITH_FULL_IMAGE:f… view at source ↗

**Figure 5.** Figure 5: Case study of online auditing, comparing predictions from DeepSeek-V4-Pro, Gemini-3-Flash, and AgentForesight-7B. Where strong baselines miss or mislocate [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Math case study comparing decisive-error verdicts from Gemini-3-Flash, GPT-4.1, and AgentForesight7B on a MATH-500 trajectory whose decisive error commits late at Step 6. Late-committing decisive errors [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

read the original abstract

LLM-based multi-agent systems are increasingly deployed on long-horizon tasks, but a single decisive error is often accepted by downstream agents and cascades into trajectory-level failure. Existing work frames this as \emph{post-hoc failure attribution}, diagnosing the responsible agent and step after the trajectory has ended. However, this paradigm forfeits any opportunity to intervene while trajectory is still unfolding. In this work, we introduce AgentForesight, a framework that reframes this problem as online auditing: at each step of an unfolding trajectory, an auditor observes only the current prefix and must either continue the run or alarm at the earliest decisive error, without access to future steps. To this end, we curate AFTraj-2K, a corpus of agentic trajectories across Coding, Math, and Agentic domains, in which safe trajectories are retained under a strict curation pipeline and unsafe trajectories are annotated at the step of their decisive error via consensus among multiple LLM judges. Built on that, we develop AgentForesight-7B, a compact online auditor trained with a coarse-to-fine reinforcement learning recipe that first equips it with a risk-anticipation prior at the failure boundary on adjacent safe/unsafe prefix pairs, then sharpens this prior into precise step-level localization under a three-axis reward jointly targeting the what, where, and who of an audit verdict. Across AFTraj-2K and an external Who\&When benchmark, AgentForesight-7B outperforms leading proprietary models, including GPT-4.1 and DeepSeek-V4-Pro, achieving up to +19.9% performance gain and 3$\times$ lower step localization error, opening the loop from post-hoc failures detection to enabling deployment-time intervention. Project page: https://zbox1005.github.io/agent-foresight/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgentForesight pushes online auditing for multi-agent failures with a new dataset and RL recipe, but the LLM-judge labels likely carry hindsight bias.

read the letter

The paper's core move is to treat failure prediction in multi-agent LLM systems as an online auditing problem rather than post-hoc analysis. At each step, the auditor sees only the prefix so far and must decide to continue or alarm on the earliest decisive error. They support this with AFTraj-2K, a new set of trajectories in coding, math, and agentic domains, where unsafe ones get step-level labels from LLM judge consensus and safe ones go through strict filtering. The AgentForesight-7B model uses a coarse-to-fine RL process: first building a risk prior on safe/unsafe prefix pairs, then refining it with a three-axis reward for the audit details. What works here is the clear problem reframing and the dataset release, which lets others test online methods. The claimed gains, up to 19.9% over GPT-4.1 with much lower localization error, suggest the approach can deliver practical value if reproducible. The main weakness is in the labels themselves. Because the judges annotate after seeing the entire run, the identified error step may depend on later outcomes that the prefix alone does not reveal. This could make the training objective easier than real deployment, where no future information exists. The paper does not appear to test whether the annotations are predictable from prefixes without hindsight, which leaves the online claim on shaky ground. The free parameters in the reward weights also invite some tuning that might not generalize. This paper is for researchers focused on making long-horizon agent systems more robust through early detection. It should go to peer review. The idea is timely and the execution has enough structure to merit detailed feedback on the data pipeline and evaluation.

Referee Report

2 major / 2 minor

Summary. The paper introduces AgentForesight, a framework that reframes failure attribution in LLM-based multi-agent systems as an online auditing task: at each step, an auditor sees only the current trajectory prefix and must decide whether to continue or raise an alarm at the earliest decisive error. The authors curate AFTraj-2K (safe trajectories retained via strict pipeline; unsafe trajectories annotated at the decisive error step by LLM-judge consensus) and train AgentForesight-7B via a coarse-to-fine RL recipe—first instilling a risk-anticipation prior on adjacent safe/unsafe prefix pairs, then refining it with a three-axis reward targeting the what/where/who of the verdict. Experiments on AFTraj-2K and the external Who&When benchmark report that the 7B model outperforms GPT-4.1 and DeepSeek-V4-Pro by up to +19.9% while reducing step-localization error by 3×.

Significance. If the central claims hold, the work would meaningfully advance reliable deployment of long-horizon multi-agent systems by shifting from post-hoc diagnosis to prefix-based early intervention. The coarse-to-fine RL recipe and the explicit separation of safe/unsafe prefix pairs constitute a concrete technical contribution; the reported gains over strong proprietary baselines on both internal and external benchmarks are noteworthy and would be of immediate interest to the community if the annotation pipeline can be shown to avoid hindsight leakage.

major comments (2)

[Dataset curation and annotation pipeline] Dataset curation (AFTraj-2K construction): the decisive-error step labels are produced by LLM-judge consensus on complete trajectories. Because judges observe future steps unavailable to the online auditor, the resulting positive/negative boundary may encode information that does not exist at the prefix the model actually sees. This directly affects both stages of the coarse-to-fine RL recipe and is load-bearing for the claim of genuine early prediction rather than post-hoc mimicry.
[Experiments and results] Evaluation protocol: the reported +19.9% gain and 3× localization improvement are presented without a detailed breakdown of how the external Who&When benchmark was adapted or whether its test prefixes were constructed to exclude any future-step information used in annotation. Without this, it is difficult to assess whether the performance edge generalizes to true online deployment.

minor comments (2)

[Training recipe] The three-axis reward weights are listed as free parameters; a brief sensitivity analysis or default values should be provided so readers can reproduce the exact training signal.
[Metrics] Clarify the precise definition of 'step localization error' (e.g., absolute step offset, normalized distance) and whether it is computed only on trajectories where an alarm is raised.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which highlight important considerations for ensuring the online nature of the auditing task. We address each major point below with clarifications and proposed revisions. These changes will strengthen the transparency of our dataset construction and evaluation protocol without altering the core technical contributions.

read point-by-point responses

Referee: [Dataset curation and annotation pipeline] Dataset curation (AFTraj-2K construction): the decisive-error step labels are produced by LLM-judge consensus on complete trajectories. Because judges observe future steps unavailable to the online auditor, the resulting positive/negative boundary may encode information that does not exist at the prefix the model actually sees. This directly affects both stages of the coarse-to-fine RL recipe and is load-bearing for the claim of genuine early prediction rather than post-hoc mimicry.

Authors: We agree this is a substantive concern for any prefix-based prediction task. Our annotation protocol instructs judges to identify the earliest step at which the trajectory becomes irreversibly unsafe based on observable divergence (e.g., incorrect code commit, flawed reasoning chain), with explicit prompts to avoid relying on later recovery attempts. Multiple judges reach consensus only when the failure is locally detectable from the prefix onward. Nevertheless, we acknowledge that full-trajectory context can subtly influence boundary placement. In the revision we will (1) release the exact judge prompts and consensus rules, (2) add an ablation comparing performance when labels are derived from prefix-only human review on a 200-trajectory subset, and (3) include a limitations paragraph quantifying residual leakage risk. These additions provide the requested transparency while preserving the claim that the auditor itself never sees future tokens. revision: partial
Referee: [Experiments and results] Evaluation protocol: the reported +19.9% gain and 3× localization improvement are presented without a detailed breakdown of how the external Who&When benchmark was adapted or whether its test prefixes were constructed to exclude any future-step information used in annotation. Without this, it is difficult to assess whether the performance edge generalizes to true online deployment.

Authors: We thank the referee for noting this omission. The Who&When test prefixes were created by truncating each trajectory exactly at the original annotation's decisive-error step, discarding all subsequent tokens; no post-failure information is present in any input the model receives. We will add a dedicated appendix subsection that (a) reproduces the truncation procedure, (b) provides pseudocode for prefix construction, and (c) reports a verification step confirming zero future-token leakage. This documentation will make the online evaluation protocol fully reproducible and directly address the generalization question. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper curates AFTraj-2K with LLM-judge annotations on full trajectories for decisive error steps, then trains AgentForesight-7B via a coarse-to-fine RL procedure on prefix observations only, followed by evaluation on AFTraj-2K and an external Who&When benchmark. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction or result to the inputs by construction; the training signal and performance metrics remain independent of any tautological redefinition of the target labels.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim relies on the assumption that the curated dataset reflects real failure points and that the RL training generalizes to new trajectories. Since only abstract is available, the ledger is based on described components: reliance on LLM-as-judge for labeling and the effectiveness of the coarse-to-fine RL.

free parameters (1)

three-axis reward weights
The joint reward targeting what, where, and who of an audit verdict is likely tuned during the RL training process.

axioms (1)

domain assumption Consensus among multiple LLM judges accurately identifies the decisive error step in unsafe trajectories
This is used to annotate the AFTraj-2K dataset as described in the abstract.

pith-pipeline@v0.9.0 · 5648 in / 1447 out tokens · 59623 ms · 2026-05-15T05:20:31.048676+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · 15 internal anchors

[1]

Introducing claude haiku 4.5

Anthropic. Introducing claude haiku 4.5. https://www.anthropic.com/news/ claude-haiku-4-5, October 2025. Accessed: 2026-05-02

work page 2025
[2]

Monitoring reasoning models for misbehavior.arXiv preprint arXiv:2503.11926,

Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation.arXiv preprint arXiv:2503.11926, 2025

work page arXiv 2025
[3]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi- agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

work page 2026
[7]

Gemini 3 flash: Frontier intelligence built for speed

Tulsee Doshi. Gemini 3 flash: Frontier intelligence built for speed. https://blog. google/products-and-platforms/products/gemini/gemini-3-flash/ , December

work page
[8]

Accessed: 2026-05-01

Google Blog. Accessed: 2026-05-01

work page 2026
[9]

Lm-polygraph: Uncertainty estimation for language models

Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, et al. Lm-polygraph: Uncertainty estimation for language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages ...

work page 2023
[10]

Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Gemma 3 Technical Report

Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, March 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Sciagents: automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials, 37(22):2413523, 2025

Alireza Ghafarollahi and Markus J Buehler. Sciagents: automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials, 37(22):2413523, 2025

work page 2025
[13]

Robin: A multi-agent system for automating scientific discovery.arXiv preprint arXiv:2505.13400, 2025

Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J Szostkiewicz, Jon M Laurent, Muhammed T Razzak, Andrew D White, Michaela M Hinks, and Samuel G Rodriques. Robin: A multi-agent system for automating scientific discovery.arXiv preprint arXiv:2505.13400, 2025

work page arXiv 2025
[14]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

A survey on llm-as-a-judge.The Innovation, 2024

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024. 10

work page 2024
[16]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations, 2023

work page 2023
[19]

Large Language Models Cannot Self-Correct Reasoning Yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet.arXiv preprint arXiv:2310.01798, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Testing and under- standing erroneous planning in llm agents through synthesized user inputs.arXiv preprint arXiv:2404.17833, 2024

Zhenlan Ji, Daoyuan Wu, Pingchuan Ma, Zongjie Li, and Shuai Wang. Testing and under- standing erroneous planning in llm agents through synthesized user inputs.arXiv preprint arXiv:2404.17833, 2024

work page arXiv 2024
[21]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Reliable weak-to-strong monitoring of llm agents.arXiv preprint arXiv:2508.19461, 2025

Neil Kale, Chen Bo Calvin Zhang, Kevin Zhu, Ankit Aich, Paula Rodriguez, Scale Red Team, Christina Q Knight, and Zifan Wang. Reliable weak-to-strong monitoring of llm agents.arXiv preprint arXiv:2508.19461, 2025

work page arXiv 2025
[24]

Shade-arena: Evaluating sabotage and monitoring in llm agents.arXiv preprint arXiv:2506.15740, 2025

Jonathan Kutasov, Yuqi Sun, Paul Colognese, Teun van der Weij, Linda Petrini, Chen Bo Calvin Zhang, John Hughes, Xiang Deng, Henry Sleight, Tyler Tracy, et al. Shade-arena: Evaluating sabotage and monitoring in llm agents.arXiv preprint arXiv:2506.15740, 2025

work page arXiv 2025
[25]

Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

work page 2023
[26]

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, et al. Atbench: A diverse and realistic trajectory benchmark for long-horizon agent safety.arXiv preprint arXiv:2604.02022, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

In-the-flow agentic system optimization for effective planning and tool use.arXiv preprint arXiv:2510.05592, 2025

Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, and Pan Lu. In-the-flow agentic system optimization for effective planning and tool use.arXiv preprint arXiv:2510.05592, 2025

work page arXiv 2025
[28]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

work page 2023
[29]

Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990, 2025

Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, et al. Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990, 2025

work page arXiv 2025
[30]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in neural information processing systems, 36:21558–21572, 2023. 11

work page 2023
[31]

Llm collaboration with multi- agent reinforcement learning

Shuo Liu, Zeyu Liang, Xueguang Lyu, and Christopher Amato. Llm collaboration with multi- agent reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 32150–32158, 2026

work page 2026
[32]

G-eval: Nlg evaluation using gpt-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 2511–2522, 2023

work page 2023
[33]

Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

work page 2023
[34]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[35]

Selfcheck: Using llms to zero-shot check their own step-by-step reasoning.arXiv preprint arXiv:2308.00436, 2023

Ning Miao, Yee Whye Teh, and Tom Rainforth. Selfcheck: Using llms to zero-shot check their own step-by-step reasoning.arXiv preprint arXiv:2308.00436, 2023

work page arXiv 2023
[36]

Defining and detecting the defects of large language model-based autonomous agents.IEEE Transactions on Software Engineering, 2026

Kaiwen Ning, Jiachi Chen, Jingwen Zhang, Wei Li, Zexu Wang, Yuming Feng, Weizhe Zhang, and Zibin Zheng. Defining and detecting the defects of large language model-based autonomous agents.IEEE Transactions on Software Engineering, 2026

work page 2026
[37]

Gpt-5 system card

OpenAI. Gpt-5 system card. https://openai.com/index/gpt-5-system-card/ , August

work page
[38]

Accessed: 2026-05-01

System card. Accessed: 2026-05-01

work page 2026
[39]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[40]

Memgpt: towards llms as operating systems

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonza- lez. Memgpt: towards llms as operating systems. 2023

work page 2023
[41]

The why behind the action: Unveiling internal drivers via agentic attribution.arXiv preprint arXiv:2601.15075, 2026

Chen Qian, Peng Wang, Dongrui Liu, Junyao Yang, Dadi Guo, Ling Tang, Jilin Mei, Qihan Ren, Shuai Shao, Yong Liu, et al. The why behind the action: Unveiling internal drivers via agentic attribution.arXiv preprint arXiv:2601.15075, 2026

work page arXiv 2026
[42]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[43]

smolagents: A smol library to build great agentic systems.Hugging Face, 2025

Aymeric Roucher, A Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunis- mäki. smolagents: A smol library to build great agentic systems.Hugging Face, 2025

work page 2025
[44]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

work page 2023
[45]

Stress testing deliberative alignment for anti-scheming training.arXiv preprint arXiv:2509.15541, 2025

Bronson Schoen, Evgenia Nitishinskaya, Mikita Balesni, Axel Højmark, Felix Hofstätter, Jérémy Scheurer, Alexander Meinke, Jason Wolfe, Teun van der Weij, Alex Lloyd, et al. Stress testing deliberative alignment for anti-scheming training.arXiv preprint arXiv:2509.15541, 2025

work page arXiv 2025
[46]

Approximating KL divergence

John Schulman. Approximating KL divergence. http://joschu.net/blog/kl-approx. html, 2020. Blog post

work page 2020
[47]

Your agent may misevolve: Emergent risks in self-evolving llm agents.arXiv preprint arXiv:2509.26354, 2025

Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, et al. Your agent may misevolve: Emergent risks in self-evolving llm agents.arXiv preprint arXiv:2509.26354, 2025. 12

work page arXiv 2025
[48]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

work page 2025
[49]

From commands to prompts: Llm-based semantic file system for aios

Zeru Shi, Kai Mei, Mingyu Jin, Yongye Su, Chaoji Zuo, Wenyue Hua, Wujiang Xu, Yujie Ren, Zirui Liu, Mengnan Du, et al. From commands to prompts: Llm-based semantic file system for aios. InInternational Conference on Learning Representations, volume 2025, pages 33108–33131, 2025

work page 2025
[50]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

work page 2023
[51]

Verila: A human-centered evaluation framework for interpretable verification of llm agent failures.arXiv preprint arXiv:2503.12651, 2025

Yoo Yeon Sung, Hannah Kim, and Dan Zhang. Verila: A human-centered evaluation framework for interpretable verification of llm agent failures.arXiv preprint arXiv:2503.12651, 2025

work page arXiv 2025
[52]

Math-shepherd: Verify and reinforce llms step-by-step without human annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024

work page 2024
[53]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[55]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

work page 2024
[56]

Agentgym-rl: Training LLM agents for long-horizon decision making through multi-turn reinforcement learning

Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, et al. Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning.arXiv preprint arXiv:2509.08755, 2025

work page arXiv 2025
[57]

Agentprm: Process reward models for llm agents via step-wise promise and progress

Zhiheng Xi, Chenyang Liao, Guanyu Li, Zhihao Zhang, Wenxiang Chen, Binghai Wang, Senjie Jin, Yuhao Zhou, Jian Guan, Wei Wu, et al. Agentprm: Process reward models for llm agents via step-wise promise and progress. InProceedings of the ACM Web Conference 2026, pages 4184–4195, 2026

work page 2026
[58]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

work page 2018
[60]

Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

work page 2023
[61]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022
[62]

Dive into the agent matrix: A realistic evaluation of self-replication risk in llm agents.arXiv preprint arXiv:2509.25302, 2025

Boxuan Zhang, Yi Yu, Jiaxuan Guo, and Jing Shao. Dive into the agent matrix: A realistic evaluation of self-replication risk in llm agents.arXiv preprint arXiv:2509.25302, 2025. 13

work page arXiv 2025
[63]

Cot-uq: Improving response-wise uncertainty quantification in llms with chain-of-thought

Boxuan Zhang and Ruqi Zhang. Cot-uq: Improving response-wise uncertainty quantification in llms with chain-of-thought. InFindings of the Association for Computational Linguistics: ACL 2025, pages 26114–26133, 2025

work page 2025
[64]

Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025

Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng Yan. Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025

work page arXiv 2025
[65]

Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212, 2025

Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, et al. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212, 2025

work page arXiv 2025
[66]

Processbench: Identifying process errors in mathematical reasoning

Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Processbench: Identifying process errors in mathematical reasoning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1009–1024, 2025

work page 2025
[67]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

work page 2023
[68]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al. Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

work page arXiv 2025
[70]

Language agents as optimizable graphs.arXiv preprint arXiv:2402.16823, 2024

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Language agents as optimizable graphs.arXiv preprint arXiv:2402.16823, 2024

work page arXiv 2024
[71]

Does this step contain a critical error? Answer with only ‘yes’ or ‘no’

Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. Agent- as-a-judge: Evaluate agents with agents.arXiv preprint arXiv:2410.10934, 2024. 14 Appendices A Algorithmic Pipeline 16 B Additional Experiment Setups 17 B.1 Details of Datasets . . ....

work page arXiv 2024
[72]

Walk through each visible step chronologically

work page
[73]

For each agent action, ask: was this action appropriate given prior context? Did any tool result reveal information the agent ignored?

work page
[74]

Identify the EARLIEST decisive error supported by the visible evidence, if any

work page
[75]

step 3") and agent names (e.g

If no step in the visible window contains a decisive error, answer SAFE. ## Response Format (STRICT) Your response MUST follow this exact two-block format: <think> Walk through the visible trajectory step-by-step. Reference specific step numbers (e.g. "step 3") and agent names (e.g. "TaskSolver", "Geography_Expert"). State whether a decisive error is supp...

work page
[76]

It is a SUBSTANTIVE error, not a superficial formatting issue

work page
[77]

It is a DECISIVE error: correcting it would likely prevent the failure

work page
[78]

It is as EARLY as possible, but still genuinely causal

work page
[79]

The mistake_step MUST be an exact step number from the trajectory

work page
[80]

The mistake_agent MUST exactly match the agent at that step

work page

Showing first 80 references.