SAFARI: Scaling Long Horizon Agentic Fault Attribution via Active Investigation

Chenyang Zhu; Daben Liu; Erin Babinsky; Jiayu Yao; Jingyu Wu; Kushal Chawla; Nathan Wolfe; Pengshan Cai; Sambit Sahu; Sangwoo Cho

arxiv: 2606.24626 · v1 · pith:RKYJL7KXnew · submitted 2026-06-23 · 💻 cs.AI

SAFARI: Scaling Long Horizon Agentic Fault Attribution via Active Investigation

Chenyang Zhu , Jiayu Yao , Kushal Chawla , Youbing Yin , Nathan Wolfe , Pengshan Cai , Jingyu Wu , Spencer Hong

show 5 more authors

Sangwoo Cho Shi-Xiong Zhang Daben Liu Sambit Sahu Erin Babinsky

This is my paper

Pith reviewed 2026-06-25 23:30 UTC · model grok-4.3

classification 💻 cs.AI

keywords agent fault attributionlong-horizon trajectoriestool-augmented diagnosisshort-term memorytrajectory analysisautonomous agentscontext window limits

0 comments

The pith

SAFARI uses a tool-augmented diagnostic loop with segment search tools and persistent short-term memory to attribute faults in agent trajectories that exceed LLM context windows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that fault attribution for complex agent executions can continue working even when the full trace no longer fits inside any single context window. It replaces the usual approach of loading the entire trajectory at once with an active loop in which the model calls tools to read or search chosen segments and keeps a running short-term memory across steps. A sympathetic reader would care because real agent tasks already produce traces longer than current models can hold, and without a new method the ability to understand why an agent failed would disappear as task length grows.

Core claim

SAFARI replaces linear context loading with a tool-augmented diagnostic loop. By equipping LLMs with a specialized toolbox to read and search trajectory segments alongside a persistent Short-Term Memory for cross-turn reasoning, the framework decouples diagnostic accuracy from architectural context limits. Experiments show it outperforms prior results by 20 percent on the Who&When dataset inside a 1M token budget and by 19 percent on the TRAIL GAIA subset inside a 25K token budget, while still reaching 0.58 precision when the target fault lies 5 times beyond the model's native context window.

What carries the argument

The tool-augmented diagnostic loop that uses a toolbox for reading and searching trajectory segments together with persistent Short-Term Memory to investigate without full context loading.

If this is right

Fault diagnosis accuracy becomes independent of whether the entire execution trace fits inside the model's context window.
Performance on the Who&When benchmark rises by 20 percent inside a 1M token budget relative to earlier evaluators.
Precision holds at 0.58 on faults located 5 times past the native context limit, a regime where full-trajectory methods return no usable result.
On the TRAIL GAIA subset the same loop improves results by 19 percent inside a 25K token budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern of targeted segment access plus running memory could be tried on other long-sequence diagnostic problems such as program execution traces or multi-step scientific experiments.
Systems that avoid loading full histories might lower the token cost of repeated agent evaluations even when context windows continue to grow.
One could test whether the toolbox needs to be enlarged when the underlying agent itself uses multiple interacting models rather than a single one.

Load-bearing premise

A specialized toolbox for reading and searching trajectory segments plus a persistent short-term memory can preserve all information necessary for accurate fault attribution without the full trajectory in context.

What would settle it

A concrete test trajectory in which the decisive fault evidence sits in a segment the search tools never retrieve and the memory never retains, causing SAFARI to output an incorrect attribution while any method that loads the full trace succeeds.

Figures

Figures reproduced from arXiv: 2606.24626 by Chenyang Zhu, Daben Liu, Erin Babinsky, Jiayu Yao, Jingyu Wu, Kushal Chawla, Nathan Wolfe, Pengshan Cai, Sambit Sahu, Sangwoo Cho, Shi-Xiong Zhang, Spencer Hong, Youbing Yin.

**Figure 1.** Figure 1: Illustration of SAFARI’s Active Investigation loop. Hand-crafted subset), derived from GAIA (Mialon et al., 2023) and AssistantBench (Yoran et al., 2024). Compared to the Algorithm-Generated subset, the Hand-crafted subset features significantly longer trajectory complexity with more agentic steps and tokens. TRAIL (Deshpande et al., 2025) provides long-horizon agentic trajectories, with some traces excee… view at source ↗

**Figure 2.** Figure 2: Performance comparison on TRAIL and Who&When dataset. SAFARI performs better than baseline on both datasets across various fixed token budget and metrics. Full benchmark results can be found in Appendix A [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

As autonomous agents tackle increasingly complex multi-step, multi-agent tasks, their execution trajectories have scaled beyond the constraints of even the largest context windows. Current methods for effectively diagnosing agent failures load the full trajectory into an LLM's context window, which suffers from attention dilution and fails when agentic traces inevitably exceed context limits. To address this, we introduce SAFARI (Scaling long-horizon Agentic Fault AttRibution via active Investigation), a framework that replaces linear context loading with a tool-augmented diagnostic loop. By equipping LLMs with a specialized toolbox to read and search trajectory segments alongside a persistent Short-Term Memory (STM) for cross-turn reasoning, SAFARI effectively decouples diagnostic accuracy from architectural context limits. Our experiments demonstrate that SAFARI outperforms state-of-the-art results by 20% on the Who&When dataset within a 1M token budget, and by 19% on TRAIL GAIA subset on a 25K token budget. Most significantly, SAFARI maintains a 0.58 precision even when the target fault resides 5x beyond the model's native context window, a scenario where traditional evaluators fail entirely.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAFARI's toolbox-plus-STM loop targets a real scaling limit in agent fault diagnosis, but the abstract supplies too few method details to judge whether the 0.58 precision claim holds.

read the letter

The main takeaway is that SAFARI replaces full-trajectory context loading with an LLM loop that uses a specialized read/search toolbox on trajectory segments plus a persistent short-term memory. It reports 20% gains on Who&When inside a 1M token budget, 19% on a TRAIL GAIA subset inside 25K tokens, and 0.58 precision when the fault sits 5x beyond native context where baselines drop to zero.

The paper correctly identifies attention dilution as a practical blocker for long multi-agent traces and offers a concrete alternative that keeps token use low. That framing is useful for anyone building or evaluating agents on extended tasks.

The soft spots sit in the missing implementation. The abstract names the toolbox and STM but gives no search algorithm, update rule for the memory, or guarantee that relevant segments will be retrieved when the fault is distant. The stress-test concern about possible missed evidence therefore stands on the current text. The performance deltas are also stated without baseline definitions, dataset sizes, or error breakdown, so it is difficult to know how much the numbers depend on the new components versus other factors.

This is for researchers working on agent reliability and long-horizon evaluation. A reader already thinking about context limits would find the direction worth following, but would need the full methods and results sections to decide whether the approach is robust.

It deserves peer review. The problem is concrete and the proposed fix is straightforward enough that referees can check the missing pieces directly.

Referee Report

2 major / 0 minor

Summary. The paper introduces SAFARI, a framework that replaces full-trajectory context loading with a tool-augmented diagnostic loop consisting of a specialized toolbox for reading and searching trajectory segments plus a persistent Short-Term Memory (STM). It claims this decouples diagnostic accuracy from LLM context limits, yielding 20% improvement over SOTA on the Who&When dataset (1M token budget), 19% on the TRAIL GAIA subset (25K token budget), and 0.58 precision when the target fault lies 5x beyond the native context window (where traditional evaluators fail entirely).

Significance. If the empirical results hold under rigorous verification, the work addresses a practical scalability barrier in diagnosing failures of complex multi-agent systems whose traces exceed current context windows, offering a concrete alternative to attention-dilution problems.

major comments (2)

[Abstract] Abstract: the central performance claims (20% and 19% deltas, 0.58 precision at 5x context distance) are presented without any experimental protocol, baseline definitions, dataset statistics, or error analysis, so the comparisons cannot be assessed.
[Abstract] Abstract: the claim that toolbox-guided search plus STM recovers all fault-relevant segments and cross-segment dependencies without information loss is load-bearing for the 5x-context result, yet the text supplies neither the search algorithm, the STM update rule, nor any completeness argument for arbitrary long-horizon traces.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting issues with the abstract's presentation of results and framework details. We address each comment below and will revise the abstract accordingly while preserving the manuscript's core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claims (20% and 19% deltas, 0.58 precision at 5x context distance) are presented without any experimental protocol, baseline definitions, dataset statistics, or error analysis, so the comparisons cannot be assessed.

Authors: We agree the abstract is too terse for full assessment of the claims. The full experimental protocol, baselines (SOTA methods), dataset statistics (Who&When at 1M tokens, TRAIL GAIA at 25K tokens), and error analysis appear in Section 4. In revision we will expand the abstract with a concise clause such as 'evaluated on Who&When (1M token budget) and TRAIL GAIA (25K token budget) against SOTA baselines' while keeping length constraints in mind. revision: yes
Referee: [Abstract] Abstract: the claim that toolbox-guided search plus STM recovers all fault-relevant segments and cross-segment dependencies without information loss is load-bearing for the 5x-context result, yet the text supplies neither the search algorithm, the STM update rule, nor any completeness argument for arbitrary long-horizon traces.

Authors: The abstract summarizes at a high level; the search algorithm, STM update rule, and supporting arguments for segment recovery are detailed in Section 3 (Methodology), including the toolbox design and persistent memory mechanics. We will revise the abstract to add a brief clause referencing the active investigation loop and Section 3, and to qualify the 'without information loss' phrasing as empirically supported rather than a universal guarantee. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results on external datasets

full rationale

The paper introduces the SAFARI framework and reports direct empirical performance numbers (20% and 19% outperformance, 0.58 precision at 5x context distance) on named external datasets (Who&When, TRAIL GAIA subset). No equations, fitted parameters, self-definitional quantities, or load-bearing self-citations appear in the provided claims. The results are presented as measured outcomes rather than quantities derived by construction from the method's own inputs or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, mathematical axioms, or independently evidenced invented entities; the short-term memory and toolbox are presented as engineering components of the framework.

pith-pipeline@v0.9.1-grok · 5771 in / 1111 out tokens · 28709 ms · 2026-06-25T23:30:06.568491+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 3 linked inside Pith

[1]

Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Raffles: Reasoning-based attribution of faults for llm systems , author=. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[2]

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems , author=
[3]

arXiv preprint arXiv:2509.25370 , year=

Where llm agents fail and how they can learn from failures , author=. arXiv preprint arXiv:2509.25370 , year=

arXiv
[4]

arXiv preprint arXiv:2505.08638 , year=

Trail: Trace reasoning and agentic issue localization , author=. arXiv preprint arXiv:2505.08638 , year=

arXiv
[5]

arXiv preprint arXiv:2411.04468 , year=

Magentic-one: A generalist multi-agent system for solving complex tasks , author=. arXiv preprint arXiv:2411.04468 , year=

Pith/arXiv arXiv
[6]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Assistantbench: Can web agents solve realistic and time-consuming tasks? , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[7]

The Twelfth International Conference on Learning Representations , year=

Gaia: a benchmark for general ai assistants , author=. The Twelfth International Conference on Learning Representations , year=
[8]

arXiv preprint arXiv:2510.24699 , year=

AgentFold: Long-Horizon Web Agents with Proactive Context Management , author=. arXiv preprint arXiv:2510.24699 , year=

arXiv
[9]

arXiv preprint arXiv:2510.27246 , year=

Beyond a million tokens: Benchmarking and enhancing long-term memory in llms , author=. arXiv preprint arXiv:2510.27246 , year=

arXiv
[10]

arXiv preprint arXiv:2405.19425 , year=

Adaptive in-conversation team building for language model agents , author=. arXiv preprint arXiv:2405.19425 , year=

arXiv
[11]

Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

Interactive debugging and steering of multi-agent ai systems , author=. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

2025
[12]

arXiv preprint arXiv:2509.24088 , year=

CORRECT: COndensed eRror RECognition via knowledge Transfer in multi-agent systems , author=. arXiv preprint arXiv:2509.24088 , year=

Pith/arXiv arXiv
[13]

and Yang, Shuyi and Agrawal, Lakshya A

Cemri, Mert and Pan, Melissa Z. and Yang, Shuyi and Agrawal, Lakshya A. and Chopra, Bhavya and Tiwari, Rishabh and Keutzer, Kurt and Parameswaran, Aditya and Klein, Dan and Ramchandran, Kannan and Zaharia, Matei and Gonzalez, Joseph E. and Stoica, Ion , title =. arXiv preprint arXiv:2503.13657 , year =

Pith/arXiv arXiv
[14]

arXiv preprint arXiv:2509.13782 , year =

Ge, Yu and Xie, Linna and Li, Zhong and Pei, Yu and Zhang, Tian , title =. arXiv preprint arXiv:2509.13782 , year =

arXiv

[1] [1]

Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Raffles: Reasoning-based attribution of faults for llm systems , author=. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[2] [2]

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems , author=

[3] [3]

arXiv preprint arXiv:2509.25370 , year=

Where llm agents fail and how they can learn from failures , author=. arXiv preprint arXiv:2509.25370 , year=

arXiv

[4] [4]

arXiv preprint arXiv:2505.08638 , year=

Trail: Trace reasoning and agentic issue localization , author=. arXiv preprint arXiv:2505.08638 , year=

arXiv

[5] [5]

arXiv preprint arXiv:2411.04468 , year=

Magentic-one: A generalist multi-agent system for solving complex tasks , author=. arXiv preprint arXiv:2411.04468 , year=

Pith/arXiv arXiv

[6] [6]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Assistantbench: Can web agents solve realistic and time-consuming tasks? , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024

[7] [7]

The Twelfth International Conference on Learning Representations , year=

Gaia: a benchmark for general ai assistants , author=. The Twelfth International Conference on Learning Representations , year=

[8] [8]

arXiv preprint arXiv:2510.24699 , year=

AgentFold: Long-Horizon Web Agents with Proactive Context Management , author=. arXiv preprint arXiv:2510.24699 , year=

arXiv

[9] [9]

arXiv preprint arXiv:2510.27246 , year=

Beyond a million tokens: Benchmarking and enhancing long-term memory in llms , author=. arXiv preprint arXiv:2510.27246 , year=

arXiv

[10] [10]

arXiv preprint arXiv:2405.19425 , year=

Adaptive in-conversation team building for language model agents , author=. arXiv preprint arXiv:2405.19425 , year=

arXiv

[11] [11]

Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

Interactive debugging and steering of multi-agent ai systems , author=. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

2025

[12] [12]

arXiv preprint arXiv:2509.24088 , year=

CORRECT: COndensed eRror RECognition via knowledge Transfer in multi-agent systems , author=. arXiv preprint arXiv:2509.24088 , year=

Pith/arXiv arXiv

[13] [13]

and Yang, Shuyi and Agrawal, Lakshya A

Cemri, Mert and Pan, Melissa Z. and Yang, Shuyi and Agrawal, Lakshya A. and Chopra, Bhavya and Tiwari, Rishabh and Keutzer, Kurt and Parameswaran, Aditya and Klein, Dan and Ramchandran, Kannan and Zaharia, Matei and Gonzalez, Joseph E. and Stoica, Ion , title =. arXiv preprint arXiv:2503.13657 , year =

Pith/arXiv arXiv

[14] [14]

arXiv preprint arXiv:2509.13782 , year =

Ge, Yu and Xie, Linna and Li, Zhong and Pei, Yu and Zhang, Tian , title =. arXiv preprint arXiv:2509.13782 , year =

arXiv