FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search

Dong Jae Kim; Md Ahasanuzzaman; Md Nakhla Rafi; Tse-Hsun Chen; Zhijie Wang

arxiv: 2606.00765 · v1 · pith:WHDM5LSYnew · submitted 2026-05-30 · 💻 cs.AI

FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search

Md Nakhla Rafi , Md Ahasanuzzaman , Dong Jae Kim , Zhijie Wang , Tse-Hsun Chen This is my paper

Pith reviewed 2026-06-28 18:37 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsfailure attributionmulti-agent trajectoriesdependency tracingresponsible agent identificationdecisive step detectiontrajectory diagnosis

0 comments

The pith

FALAT frames failure attribution in LLM agent trajectories as dependency-guided search that first builds an expected solution path then isolates the decisive error-introducing step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that failures in long LLM agent trajectories cannot be diagnosed by treating each step as an independent classification problem because errors propagate through dependent decisions, tool outputs, and messages. FALAT instead constructs an expectation of how the task should be solved, uses it to locate suspicious regions, traces dependency links to separate steps that introduce errors from those that only inherit them, and finally tests whether correcting a candidate step would recover the expected outcome. This process identifies both the responsible agent and the decisive failure step. On the Who&When benchmark the method reaches 46 percent step-level accuracy on algorithm-generated trajectories and 29.1 percent on hand-crafted ones, exceeding direct prompting and prior attribution baselines.

Core claim

FALAT frames attribution as a dependency-guided search problem. It first constructs an expectation of how the task should be solved and uses this expectation to identify suspicious regions in the trajectory. It then traces dependencies among decisions, tool outputs, and agent messages to distinguish error-introducing steps from steps that merely inherit or propagate prior mistakes. Finally, FALAT evaluates whether correcting a candidate step would be sufficient to recover the expected outcome, allowing it to identify both the responsible agent and the decisive failure step.

What carries the argument

Dependency-guided search that constructs an expected solution path, locates suspicious regions, traces decision dependencies, and tests outcome recovery after hypothetical correction.

If this is right

Responsible-agent and decisive-step attribution both improve over baselines that ignore dependencies.
The same search procedure works on both algorithm-generated and hand-crafted multi-agent failure trajectories.
Direct prompting of standalone LLMs is outperformed once dependency tracing and recovery testing are added.
Dependency-aware reasoning is required for reliable diagnosis rather than independent step classification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be applied to logged trajectories from deployed agent systems without requiring new benchmarks.
Identifying common decisive steps across many runs might reveal recurring failure patterns that could be mitigated at design time.
The expectation-construction step might itself become a point of failure when tasks lack a single canonical solution path.
Combining the search with execution replay could allow automated repair suggestions beyond mere attribution.

Load-bearing premise

An accurate expectation of the correct task solution can be constructed and then used to reliably flag suspicious regions and to judge whether fixing a step restores the expected outcome.

What would settle it

A controlled test set of trajectories where the constructed expectation is deliberately inaccurate yet FALAT is still run; if attribution accuracy remains high the central mechanism is not doing the claimed work.

Figures

Figures reproduced from arXiv: 2606.00765 by Dong Jae Kim, Md Ahasanuzzaman, Md Nakhla Rafi, Tse-Hsun Chen, Zhijie Wang.

**Figure 1.** Figure 1: Overview of FALAT. Stage 1 constructs an external prior π, a three-level trajectory representation M, and an initial candidate set C. Stage 2 constructs typed dependencies to separate possible error sources from downstream carriers and prune candidates. Stage 3 performs dependencyguided search and verifies whether fixing a candidate would recover the expected output. Stage 4 locally verifies the predict… view at source ↗

read the original abstract

LLM-based agents increasingly solve complex tasks through long trajectories involving reasoning steps, tool calls, and inter-agent communication. However, when these agents fail, it is often unclear which agent caused the failure and which step introduced the decisive error. This attribution problem is challenging because mistakes can propagate across the trajectory: later actions may appear incorrect, but only because they depend on an earlier corrupted state. Therefore, failure attribution cannot be treated as independent step-level classification. We propose FALAT, a diagnostic framework for failure attribution in LLM agent trajectories. FALAT frames attribution as a dependency-guided search problem. It first constructs an expectation of how the task should be solved and uses this expectation to identify suspicious regions in the trajectory. It then traces dependencies among decisions, tool outputs, and agent messages to distinguish error-introducing steps from steps that merely inherit or propagate prior mistakes. Finally, FALAT evaluates whether correcting a candidate step would be sufficient to recover the expected outcome, allowing it to identify both the responsible agent and the decisive failure step. We evaluate FALAT on the Who&When benchmark, which includes both algorithm-generated and hand-crafted multi-agent failure trajectories. The results show that FALAT consistently improves responsible-agent and decisive-step attribution. Its best configurations achieve 46.0% step-level accuracy on algorithm-generated trajectories and 29.1% on the more challenging hand-crafted trajectories, outperforming specialized attribution baselines and direct prompting with standalone LLMs. These findings suggest that dependency-aware reasoning is essential for reliable failure diagnosis in LLM agent systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FALAT frames failure attribution in LLM agents as dependency-guided search rather than independent classification, with reported gains on Who&When, but the expectation step lacks reported validation.

read the letter

The core idea is to build an expected solution path first, then trace dependencies to separate error-introducing steps from propagated ones, and finally test if fixing a step recovers the expectation. This is a clear shift from treating each step in isolation, and the abstract shows it beating both specialized baselines and plain LLM prompting on the benchmark.

What stands out is the pipeline itself: expectation construction to flag suspicious regions, dependency tracing across agents and tools, and the sufficiency check. The numbers are 46% step accuracy on algorithm-generated trajectories and 29.1% on hand-crafted ones. That second figure is low, but the gap over direct prompting suggests the dependency layer adds something.

The soft spot is exactly the one the stress-test flags. The expectation is load-bearing for both identifying regions and running the recovery test, yet the abstract gives no accuracy metric, prompt details, or validation on the benchmark tasks themselves. On hand-crafted trajectories especially, any bias in that expectation would make the downstream attribution numbers hard to trust. No error bars or significance tests are mentioned either.

This is for groups already running multi-agent LLM systems and hitting debugging pain on long traces. A reader who wants a concrete diagnostic layer rather than another prompting trick could get value from the framing. The work shows clear thinking about propagation and is not circular, so it deserves a serious referee even with the gaps in the current write-up. I'd send it out for review with a request to document the expectation step and add controls for its accuracy.

Referee Report

2 major / 2 minor

Summary. The paper proposes FALAT, a diagnostic framework for failure attribution in LLM agent trajectories. It frames attribution as a dependency-guided search: first constructing an expectation of correct task solution to identify suspicious regions, then tracing dependencies among decisions, tool outputs, and messages to distinguish error-introducing steps from propagations, and finally testing whether correcting a candidate step recovers the expected outcome. Evaluated on the Who&When benchmark (algorithm-generated and hand-crafted multi-agent failure trajectories), best configurations achieve 46.0% step-level accuracy on generated trajectories and 29.1% on hand-crafted ones, outperforming specialized attribution baselines and direct LLM prompting.

Significance. If the central results hold after addressing the load-bearing assumption, FALAT would represent a meaningful advance in diagnosing failures in multi-agent LLM systems by moving beyond independent step classification to dependency-aware reasoning. The dual evaluation on algorithm-generated and hand-crafted trajectories is a positive design choice that strengthens the claim that dependency tracing is essential. The work also highlights the distinction between responsible agents and decisive steps, which is a useful conceptual contribution.

major comments (2)

[Abstract, paragraph 2] Abstract, paragraph 2: The method relies on first constructing an expectation of how the task should be solved to flag suspicious regions and to test recovery upon correction. No mechanism, prompt template, accuracy metric, or validation of this expectation's fidelity on the Who&When benchmark tasks is supplied. This assumption is load-bearing for the reported 29.1% accuracy on hand-crafted trajectories, because systematic bias or incompleteness in the expectation would directly corrupt both suspicious-region identification and the recovery test, rendering the attribution improvements uninterpretable.
[Evaluation (results paragraph)] Evaluation (results paragraph): The abstract reports 46.0% and 29.1% step-level accuracies and claims outperformance, yet provides no error bars, no description of how the expectation is operationalized, no details on the dependency model, and no information on statistical significance testing or exact baseline implementations. These omissions prevent assessment of whether the numeric gains are robust or merely artifacts of the unvalidated expectation step.

minor comments (2)

The manuscript should clarify the exact definition of 'responsible-agent' and 'decisive-step' attribution metrics and how they are computed from the dependency trace.
Add a dedicated subsection describing the Who&When benchmark construction, trajectory lengths, and failure types to allow reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will incorporate the requested clarifications and additional details into a revised manuscript.

read point-by-point responses

Referee: [Abstract, paragraph 2] Abstract, paragraph 2: The method relies on first constructing an expectation of how the task should be solved to flag suspicious regions and to test recovery upon correction. No mechanism, prompt template, accuracy metric, or validation of this expectation's fidelity on the Who&When benchmark tasks is supplied. This assumption is load-bearing for the reported 29.1% accuracy on hand-crafted trajectories, because systematic bias or incompleteness in the expectation would directly corrupt both suspicious-region identification and the recovery test, rendering the attribution improvements uninterpretable.

Authors: We agree that the manuscript currently describes expectation construction only at a high level and does not supply the requested implementation details or validation. In the revision we will add a new subsection under Methods that specifies: (1) the exact mechanism (LLM-based generation of a reference solution trajectory), (2) the full prompt templates used, (3) the accuracy metric applied to measure fidelity against ground-truth solutions on Who&When tasks, and (4) quantitative validation results on both algorithm-generated and hand-crafted subsets. These additions will directly address the load-bearing concern and allow readers to assess potential bias. revision: yes
Referee: [Evaluation (results paragraph)] Evaluation (results paragraph): The abstract reports 46.0% and 29.1% step-level accuracies and claims outperformance, yet provides no error bars, no description of how the expectation is operationalized, no details on the dependency model, and no information on statistical significance testing or exact baseline implementations. These omissions prevent assessment of whether the numeric gains are robust or merely artifacts of the unvalidated expectation step.

Authors: We concur that the current presentation lacks these elements. The revised manuscript will add: error bars (standard deviation across five independent runs with different seeds), an explicit operationalization of the expectation step (cross-referenced to the new Methods subsection), a precise description of the dependency model (including how dependency edges are extracted and represented), statistical significance tests (paired t-tests with p-values against each baseline), and exact baseline re-implementation details (model versions, prompting strategies, and hyper-parameters). These changes will enable evaluation of robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity; algorithmic procedure is self-contained

full rationale

The paper describes FALAT as a sequence of algorithmic steps: constructing an expectation of correct task solution, identifying suspicious regions, tracing dependencies, and testing recovery of the expected outcome. No equations, fitted parameters, self-citations, or uniqueness theorems appear in the provided text. The expectation is treated as an external input constructed prior to dependency tracing rather than defined in terms of the attribution output. No step reduces by construction to its own inputs, satisfying the default expectation that most papers lack circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.1-grok · 5835 in / 1079 out tokens · 17363 ms · 2026-06-28T18:37:19.045525+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 35 canonical work pages · 17 internal anchors

[1]

Proceedings of the 42nd International Conference on Machine Learning , year =

Zhang, Shaokun and Yin, Ming and Zhang, Jieyu and Liu, Jiale and Han, Zhiguang and Zhang, Jingyang and Li, Beibin and Wang, Chi and Wang, Huazheng and Chen, Yiran and Wu, Qingyun , title =. Proceedings of the 42nd International Conference on Machine Learning , year =
[2]

IEEE Transactions on software engineering , volume=

How effective developers investigate source code: An exploratory study , author=. IEEE Transactions on software engineering , volume=
[3]

Proceedings of the 30th international conference on Software engineering , pages=

Debugging reinvented: asking and answering why and why not questions about program behavior , author=. Proceedings of the 30th international conference on Software engineering , pages=
[4]

Transactions of the association for computational linguistics , volume=

Lost in the middle: How language models use long contexts , author=. Transactions of the association for computational linguistics , volume=
[5]

arXiv preprint arXiv:2509.25370 , year=

Where llm agents fail and how they can learn from failures , author=. arXiv preprint arXiv:2509.25370 , year=

work page arXiv
[6]

arXiv preprint arXiv:2505.00212 , year=

Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems , author=. arXiv preprint arXiv:2505.00212 , year=

work page arXiv
[7]

ArXiv, abs/2602.23701

From Flat Logs to Causal Graphs: Hierarchical Failure Attribution for LLM-based Multi-Agent Systems , author=. arXiv preprint arXiv:2602.23701 , year=

work page arXiv
[8]

arXiv preprint arXiv:2506.18824 , year=

Understanding software engineering agents: A study of thought-action-result trajectories , author=. arXiv preprint arXiv:2506.18824 , year=

work page arXiv
[9]

ReAct: Synergizing Reasoning and Acting in Language Models

React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Proceedings of the 48th international conference on Software engineering , year=

Order Matters! An Empirical Study on Large Language Models' Input Order Bias in Software Fault Localization , author=. Proceedings of the 48th international conference on Software engineering , year=
[11]

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Yuanzhuo Wang, and Jian Guo

Who is introducing the failure? automatically attributing failures of multi-agent systems via spectrum analysis , author=. arXiv preprint arXiv:2509.13782 , year=

work page arXiv
[12]

arXiv preprint arXiv:2510.04550 , year=

TRAJECT-Bench: A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use , author=. arXiv preprint arXiv:2510.04550 , year=

work page arXiv
[13]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

AgentDiagnose: An Open Toolkit for Diagnosing LLM Agent Trajectories , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

2025
[14]

Diagnosing with Insights: Structured Analysis of Agent Failures via Behavioral Abstractions , author=
[15]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Trial and error: Exploration-based trajectory optimization of LLM agents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[16]

arXiv preprint arXiv:2505.13652 , year=

Guided Search Strategies in Non-Serializable Environments with Applications to Software Engineering Agents , author=. arXiv preprint arXiv:2505.13652 , year=

work page arXiv
[17]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Metareflection: Learning instructions for language agents using past reflections , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[18]

Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025) , pages=

From knowledge to noise: CTIM-rover and the pitfalls of episodic memory in software engineering agents , author=. Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025) , pages=

2025
[19]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Swe-bench: Can language models resolve real-world github issues? , author=. arXiv preprint arXiv:2310.06770 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Utboost: Rigorous evaluation of coding agents on swe-bench , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[21]

Agent Workflow Memory

Agent workflow memory , author=. arXiv preprint arXiv:2409.07429 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

arXiv preprint arXiv:2511.05931 , year=

Self-Abstraction from Grounded Experience for Plan-Guided Policy Refinement , author=. arXiv preprint arXiv:2511.05931 , year=

work page arXiv
[23]

Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025

AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems? , author=. arXiv preprint arXiv:2509.03312 , year=

work page arXiv
[24]

arXiv preprint arXiv:2510.04886 , year=

Where did it all go wrong? A hierarchical look into multi-agent error attribution , author=. arXiv preprint arXiv:2510.04886 , year=

work page arXiv
[25]

arXiv preprint , year=

When Only the Final Text Survives: Implicit Execution Tracing for Multi-Agent Attribution , author=. arXiv preprint , year=
[26]

arXiv preprint arXiv:2510.10581 , year=

GraphTracer: Graph-Guided Failure Tracing in LLM Agents for Robust Multi-Turn Deep Search , author=. arXiv preprint arXiv:2510.10581 , year=

work page arXiv
[27]

Let's Verify Step by Step

Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

FAMA: Failure-Aware Meta-Agentic Framework for Open-Source LLMs in Interactive Tool Use Environments

FAMA: Failure-Aware Meta-Agentic Framework for Open-Source LLMs in Interactive Tool Use Environments , author=. arXiv preprint arXiv:2604.25135 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Towards Self-Improving Error Diagnosis in Multi-Agent Systems

Towards Self-Improving Error Diagnosis in Multi-Agent Systems , author=. arXiv preprint arXiv:2604.17658 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

arXiv preprint arXiv:2603.17187 , year=

MetaClaw: Just Talk--An Agent That Meta-Learns and Evolves in the Wild , author=. arXiv preprint arXiv:2603.17187 , year=

work page arXiv
[31]

Advances in neural information processing systems , volume=

Camel: Communicative agents for" mind" exploration of large language model society , author=. Advances in neural information processing systems , volume=
[32]

First conference on language modeling , year=

Autogen: Enabling next-gen LLM applications via multi-agent conversations , author=. First conference on language modeling , year=
[33]

The twelfth international conference on learning representations , year=

MetaGPT: Meta programming for a multi-agent collaborative framework , author=. The twelfth international conference on learning representations , year=
[34]

Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

Chatdev: Communicative agents for software development , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=
[35]

Forty-first International Conference on Machine Learning , year=

Gptswarm: Language agents as optimizable graphs , author=. Forty-first International Conference on Machine Learning , year=
[36]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Dspy: Compiling declarative language model calls into self-improving pipelines , author=. arXiv preprint arXiv:2310.03714 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

TextGrad: Automatic "Differentiation" via Text

Textgrad: Automatic" differentiation" via text , author=. arXiv preprint arXiv:2406.07496 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

arXiv preprint arXiv:2502.14815 , year=

Optimizing model selection for compound ai systems , author=. arXiv preprint arXiv:2502.14815 , year=

work page arXiv
[39]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Masrouter: Learning to route llms for multi-agent systems , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[40]

arXiv preprint arXiv:2506.02951 , year=

Adaptive graph pruning for multi-agent communication , author=. arXiv preprint arXiv:2506.02951 , year=

work page arXiv
[41]

AFlow: Automating Agentic Workflow Generation

Aflow: Automating agentic workflow generation , author=. arXiv preprint arXiv:2410.10762 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Automated Design of Agentic Systems

Automated design of agentic systems , author=. arXiv preprint arXiv:2408.08435 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

arXiv preprint arXiv:2502.04180 , year=

Multi-agent architecture search via agentic supernet , author=. arXiv preprint arXiv:2502.04180 , year=

work page arXiv
[44]

arXiv preprint arXiv:2502.07373 , year=

Evoflow: Evolving diverse agentic workflows on the fly , author=. arXiv preprint arXiv:2502.07373 , year=

work page arXiv
[45]

Augmented Language Models: a Survey

Augmented language models: a survey , author=. arXiv preprint arXiv:2302.07842 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

The Twelfth International Conference on Learning Representations , year=

Gaia: a benchmark for general ai assistants , author=. The Twelfth International Conference on Learning Representations , year=
[47]

AI communications , volume=

Case-based reasoning: Foundational issues, methodological variations, and system approaches , author=. AI communications , volume=
[48]

European Workshop on Advances in Case-Based Reasoning , pages=

On the role of abstraction in case-based reasoning , author=. European Workshop on Advances in Case-Based Reasoning , pages=
[49]

International Conference on Case-Based Reasoning , pages=

Stratified case-based reasoning in non-refinable abstraction hierarchies , author=. International Conference on Case-Based Reasoning , pages=
[50]

Artificial intelligence , volume=

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning , author=. Artificial intelligence , volume=. 1999 , publisher=

1999
[51]

Artificial intelligence , volume=

Planning in a hierarchy of abstraction spaces , author=. Artificial intelligence , volume=. 1974 , publisher=

1974
[52]

If Thinking

“If Thinking” Support System for Training Historical Thinking , author=. Procedia Computer Science , volume=. 2015 , publisher=

2015
[53]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Agent-SAMA: State-Aware Mobile Assistant , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[54]

Adaptive in-conversation team building for language model agents.arXiv preprint arXiv:2405.19425, 2024

Adaptive in-conversation team building for language model agents , author=. arXiv preprint arXiv:2405.19425 , year=

work page arXiv
[55]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Assistantbench: Can web agents solve realistic and time-consuming tasks? , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[56]

Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

Magentic-one: A generalist multi-agent system for solving complex tasks , author=. arXiv preprint arXiv:2411.04468 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

Claude Sonnet , year =
[61]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

GPT-4.1 Model Documentation , year =
[63]

2026 , howpublished =

OpenAI , title =. 2026 , howpublished =

2026
[64]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Deepseek-v3. 2: Pushing the frontier of open large language models , author=. arXiv preprint arXiv:2512.02556 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[65]

2025 , howpublished =

MiniMax M2.5: Built for Real-World Productivity , author =. 2025 , howpublished =

2025
[66]

2026 , publisher =

Anonymous , title =. 2026 , publisher =. doi:10.5281/zenodo.20060709 , url =

work page doi:10.5281/zenodo.20060709 2026

[1] [1]

Proceedings of the 42nd International Conference on Machine Learning , year =

Zhang, Shaokun and Yin, Ming and Zhang, Jieyu and Liu, Jiale and Han, Zhiguang and Zhang, Jingyang and Li, Beibin and Wang, Chi and Wang, Huazheng and Chen, Yiran and Wu, Qingyun , title =. Proceedings of the 42nd International Conference on Machine Learning , year =

[2] [2]

IEEE Transactions on software engineering , volume=

How effective developers investigate source code: An exploratory study , author=. IEEE Transactions on software engineering , volume=

[3] [3]

Proceedings of the 30th international conference on Software engineering , pages=

Debugging reinvented: asking and answering why and why not questions about program behavior , author=. Proceedings of the 30th international conference on Software engineering , pages=

[4] [4]

Transactions of the association for computational linguistics , volume=

Lost in the middle: How language models use long contexts , author=. Transactions of the association for computational linguistics , volume=

[5] [5]

arXiv preprint arXiv:2509.25370 , year=

Where llm agents fail and how they can learn from failures , author=. arXiv preprint arXiv:2509.25370 , year=

work page arXiv

[6] [6]

arXiv preprint arXiv:2505.00212 , year=

Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems , author=. arXiv preprint arXiv:2505.00212 , year=

work page arXiv

[7] [7]

ArXiv, abs/2602.23701

From Flat Logs to Causal Graphs: Hierarchical Failure Attribution for LLM-based Multi-Agent Systems , author=. arXiv preprint arXiv:2602.23701 , year=

work page arXiv

[8] [8]

arXiv preprint arXiv:2506.18824 , year=

Understanding software engineering agents: A study of thought-action-result trajectories , author=. arXiv preprint arXiv:2506.18824 , year=

work page arXiv

[9] [9]

ReAct: Synergizing Reasoning and Acting in Language Models

React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Proceedings of the 48th international conference on Software engineering , year=

Order Matters! An Empirical Study on Large Language Models' Input Order Bias in Software Fault Localization , author=. Proceedings of the 48th international conference on Software engineering , year=

[11] [11]

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Yuanzhuo Wang, and Jian Guo

Who is introducing the failure? automatically attributing failures of multi-agent systems via spectrum analysis , author=. arXiv preprint arXiv:2509.13782 , year=

work page arXiv

[12] [12]

arXiv preprint arXiv:2510.04550 , year=

TRAJECT-Bench: A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use , author=. arXiv preprint arXiv:2510.04550 , year=

work page arXiv

[13] [13]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

AgentDiagnose: An Open Toolkit for Diagnosing LLM Agent Trajectories , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

2025

[14] [14]

Diagnosing with Insights: Structured Analysis of Agent Failures via Behavioral Abstractions , author=

[15] [15]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Trial and error: Exploration-based trajectory optimization of LLM agents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[16] [16]

arXiv preprint arXiv:2505.13652 , year=

Guided Search Strategies in Non-Serializable Environments with Applications to Software Engineering Agents , author=. arXiv preprint arXiv:2505.13652 , year=

work page arXiv

[17] [17]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Metareflection: Learning instructions for language agents using past reflections , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024

[18] [18]

Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025) , pages=

From knowledge to noise: CTIM-rover and the pitfalls of episodic memory in software engineering agents , author=. Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025) , pages=

2025

[19] [19]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Swe-bench: Can language models resolve real-world github issues? , author=. arXiv preprint arXiv:2310.06770 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Utboost: Rigorous evaluation of coding agents on swe-bench , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[21] [21]

Agent Workflow Memory

Agent workflow memory , author=. arXiv preprint arXiv:2409.07429 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

arXiv preprint arXiv:2511.05931 , year=

Self-Abstraction from Grounded Experience for Plan-Guided Policy Refinement , author=. arXiv preprint arXiv:2511.05931 , year=

work page arXiv

[23] [23]

Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025

AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems? , author=. arXiv preprint arXiv:2509.03312 , year=

work page arXiv

[24] [24]

arXiv preprint arXiv:2510.04886 , year=

Where did it all go wrong? A hierarchical look into multi-agent error attribution , author=. arXiv preprint arXiv:2510.04886 , year=

work page arXiv

[25] [25]

arXiv preprint , year=

When Only the Final Text Survives: Implicit Execution Tracing for Multi-Agent Attribution , author=. arXiv preprint , year=

[26] [26]

arXiv preprint arXiv:2510.10581 , year=

GraphTracer: Graph-Guided Failure Tracing in LLM Agents for Robust Multi-Turn Deep Search , author=. arXiv preprint arXiv:2510.10581 , year=

work page arXiv

[27] [27]

Let's Verify Step by Step

Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

FAMA: Failure-Aware Meta-Agentic Framework for Open-Source LLMs in Interactive Tool Use Environments

FAMA: Failure-Aware Meta-Agentic Framework for Open-Source LLMs in Interactive Tool Use Environments , author=. arXiv preprint arXiv:2604.25135 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Towards Self-Improving Error Diagnosis in Multi-Agent Systems

Towards Self-Improving Error Diagnosis in Multi-Agent Systems , author=. arXiv preprint arXiv:2604.17658 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

arXiv preprint arXiv:2603.17187 , year=

MetaClaw: Just Talk--An Agent That Meta-Learns and Evolves in the Wild , author=. arXiv preprint arXiv:2603.17187 , year=

work page arXiv

[31] [31]

Advances in neural information processing systems , volume=

Camel: Communicative agents for" mind" exploration of large language model society , author=. Advances in neural information processing systems , volume=

[32] [32]

First conference on language modeling , year=

Autogen: Enabling next-gen LLM applications via multi-agent conversations , author=. First conference on language modeling , year=

[33] [33]

The twelfth international conference on learning representations , year=

MetaGPT: Meta programming for a multi-agent collaborative framework , author=. The twelfth international conference on learning representations , year=

[34] [34]

Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

Chatdev: Communicative agents for software development , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

[35] [35]

Forty-first International Conference on Machine Learning , year=

Gptswarm: Language agents as optimizable graphs , author=. Forty-first International Conference on Machine Learning , year=

[36] [36]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Dspy: Compiling declarative language model calls into self-improving pipelines , author=. arXiv preprint arXiv:2310.03714 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

TextGrad: Automatic "Differentiation" via Text

Textgrad: Automatic" differentiation" via text , author=. arXiv preprint arXiv:2406.07496 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

arXiv preprint arXiv:2502.14815 , year=

Optimizing model selection for compound ai systems , author=. arXiv preprint arXiv:2502.14815 , year=

work page arXiv

[39] [39]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Masrouter: Learning to route llms for multi-agent systems , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[40] [40]

arXiv preprint arXiv:2506.02951 , year=

Adaptive graph pruning for multi-agent communication , author=. arXiv preprint arXiv:2506.02951 , year=

work page arXiv

[41] [41]

AFlow: Automating Agentic Workflow Generation

Aflow: Automating agentic workflow generation , author=. arXiv preprint arXiv:2410.10762 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

Automated Design of Agentic Systems

Automated design of agentic systems , author=. arXiv preprint arXiv:2408.08435 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

arXiv preprint arXiv:2502.04180 , year=

Multi-agent architecture search via agentic supernet , author=. arXiv preprint arXiv:2502.04180 , year=

work page arXiv

[44] [44]

arXiv preprint arXiv:2502.07373 , year=

Evoflow: Evolving diverse agentic workflows on the fly , author=. arXiv preprint arXiv:2502.07373 , year=

work page arXiv

[45] [45]

Augmented Language Models: a Survey

Augmented language models: a survey , author=. arXiv preprint arXiv:2302.07842 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

The Twelfth International Conference on Learning Representations , year=

Gaia: a benchmark for general ai assistants , author=. The Twelfth International Conference on Learning Representations , year=

[47] [47]

AI communications , volume=

Case-based reasoning: Foundational issues, methodological variations, and system approaches , author=. AI communications , volume=

[48] [48]

European Workshop on Advances in Case-Based Reasoning , pages=

On the role of abstraction in case-based reasoning , author=. European Workshop on Advances in Case-Based Reasoning , pages=

[49] [49]

International Conference on Case-Based Reasoning , pages=

Stratified case-based reasoning in non-refinable abstraction hierarchies , author=. International Conference on Case-Based Reasoning , pages=

[50] [50]

Artificial intelligence , volume=

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning , author=. Artificial intelligence , volume=. 1999 , publisher=

1999

[51] [51]

Artificial intelligence , volume=

Planning in a hierarchy of abstraction spaces , author=. Artificial intelligence , volume=. 1974 , publisher=

1974

[52] [52]

If Thinking

“If Thinking” Support System for Training Historical Thinking , author=. Procedia Computer Science , volume=. 2015 , publisher=

2015

[53] [53]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Agent-SAMA: State-Aware Mobile Assistant , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[54] [54]

Adaptive in-conversation team building for language model agents.arXiv preprint arXiv:2405.19425, 2024

Adaptive in-conversation team building for language model agents , author=. arXiv preprint arXiv:2405.19425 , year=

work page arXiv

[55] [55]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Assistantbench: Can web agents solve realistic and time-consuming tasks? , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024

[56] [56]

Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

Magentic-one: A generalist multi-agent system for solving complex tasks , author=. arXiv preprint arXiv:2411.04468 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[59] [59]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[60] [60]

Claude Sonnet , year =

[61] [61]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[62] [62]

GPT-4.1 Model Documentation , year =

[63] [63]

2026 , howpublished =

OpenAI , title =. 2026 , howpublished =

2026

[64] [64]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Deepseek-v3. 2: Pushing the frontier of open large language models , author=. arXiv preprint arXiv:2512.02556 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[65] [65]

2025 , howpublished =

MiniMax M2.5: Built for Real-World Productivity , author =. 2025 , howpublished =

2025

[66] [66]

2026 , publisher =

Anonymous , title =. 2026 , publisher =. doi:10.5281/zenodo.20060709 , url =

work page doi:10.5281/zenodo.20060709 2026