StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems

Gang Huang; Taiyu Zhu; Weilin Jin; Yifan Wu; Ying Li

arxiv: 2606.03467 · v1 · pith:TFQJW44Tnew · submitted 2026-06-02 · 💻 cs.AI

StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems

Taiyu Zhu , Yifan Wu , Weilin Jin , Ying Li , Gang Huang This is my paper

Pith reviewed 2026-06-28 10:10 UTC · model grok-4.3

classification 💻 cs.AI

keywords failure attributionmulti-agent systemstemporal semantic sequencesroot cause identificationstep-level error scoringLLM efficiencyexecution trajectory analysis

0 comments

The pith

StepFinder attributes root cause steps in multi-agent failures by encoding logs into temporal sequences and applying lightweight modeling instead of full LLM reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that multi-agent LLM systems often fail from single-step errors that cascade through interactions. Existing approaches ask LLMs to read entire noisy execution logs and identify the bad step, but this is slow and easily distracted by irrelevant text. StepFinder instead uses LLMs only once to turn logs into compact temporal semantic sequences, then runs fast temporal modeling and attention layers to score each step's contribution to the failure. The final score is adjusted by multi-scale differences and position bias to point to the true root cause. On the Who&When benchmark this produces higher accuracy than pure LLM methods while cutting inference time by 79 percent and eliminating text generation costs.

Core claim

StepFinder encodes execution logs into temporal semantic sequences using LLMs only during feature construction, then applies a parameter-efficient combination of temporal modeling and attention modules to capture sequential evolution and cross-step dependencies, and finally refines step-level error scores through multi-scale differences and position bias to identify the root cause step.

What carries the argument

Temporal semantic sequences produced by LLM encoding of execution logs, processed by temporal modeling plus attention modules and refined by multi-scale differences and position bias.

If this is right

Step-level failure attribution becomes accurate enough to guide targeted fixes in multi-agent workflows.
Inference cost drops sharply because only the initial encoding uses an LLM and no text is generated at runtime.
The same pipeline can be applied to any multi-step agent trajectory without retraining large models.
Position bias and multi-scale refinement together reduce false positives from early or late steps in long sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of encoding from scoring suggests similar hybrid designs could speed up other log-analysis tasks that currently call LLMs repeatedly.
If the temporal sequences prove robust across domains, the approach could be tested on non-LLM agent systems that still produce timestamped execution traces.
Extending the refinement step to include agent-interaction graphs might further improve attribution when failures involve coordination rather than single steps.

Load-bearing premise

Encoding execution logs into temporal semantic sequences with LLMs preserves enough information for the later modules to locate the true root cause without the noise that affects raw logs.

What would settle it

A controlled test on the Who&When benchmark in which the LLM encoding step drops the critical signal about the actual failing step, causing StepFinder to return the wrong root cause while raw-log LLM methods still succeed.

Figures

Figures reproduced from arXiv: 2606.03467 by Gang Huang, Taiyu Zhu, Weilin Jin, Yifan Wu, Ying Li.

**Figure 2.** Figure 2: Overview of StepFinder. Given a failure trajectory, execution logs are first encoded into temporal semantic sequences [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of tolerance accuracy (%) (±𝛿 steps) on Who&When. Agent-Aware Step Interaction Module, the projection dimension for each attention head is set to 32, with 2 attention heads in total. The model is trained for up to 50 epochs with a batch size of 16 using the AdamW optimizer and a weight decay of 1e-5. Gradient clipping with a threshold of 1.0 and early stopping with a patience of 10 epochs are ap… view at source ↗

**Figure 4.** Figure 4: Comparison of MRR@3 (%) on Who&When. These results highlight the inherent limitations of LLM baselines in long-horizon causal reasoning for multi-agent trajectories. 3) Comparison with Sequential Models. Traditional sequential models generally outperform LLM-based methods, validating our use of dedicated sequential architectures for step-level fault attribution. However, StepFinder maintains a significant… view at source ↗

**Figure 5.** Figure 5: Sensitivity analysis of key hyperparameters on Who&When (Accuracy %). The gray shaded areas indicate the robust [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

LLM-based multi-agent systems exhibit remarkable collaborative capabilities in complex multi-step tasks. However, these systems are highly sensitive to single-step execution errors that can propagate through agent interactions and lead to cascading failures. To understand the causes of failure and improve system reliability, failure attribution has been introduced as a task that aims to automatically identify the root cause step responsible for a failure. Existing failure attribution methods mainly rely on LLMs to reason over original execution trajectories, which not only incur high inference costs and latency, but also suffer from interference caused by redundant and noisy execution logs, causing LLMs to struggle in accurately identifying the true root cause step. To address this, we propose StepFinder, a lightweight failure attribution framework. We use LLMs solely during the feature construction phase to encode execution logs into temporal semantic sequences. Subsequently, a parameter-efficient combination of temporal modeling and attention modules is applied to capture the sequential evolution and cross-step dependencies of the trajectories. Finally, the step-level error score is refined through multi-scale differences and position bias, enabling precise root cause identification. Experimental results on the Who&When benchmark demonstrate that StepFinder outperforms LLM-based methods in step-level failure attribution while achieving substantially higher inference efficiency, reducing inference time by 79% compared with the fastest LLM-based method, with no text generation overhead. Our code is available at https://github.com/taiyu-zhu/StepFinder.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StepFinder splits LLM use to log encoding only then applies lightweight temporal modeling plus refinement for failure attribution, with a reported 79% time cut that needs experimental details to evaluate.

read the letter

The paper's main move is to restrict LLMs to a one-time encoding step that turns execution logs into temporal semantic sequences, then feed those into parameter-efficient temporal modeling, attention, and a multi-scale difference plus position bias refinement to score which step caused the failure. This avoids running full LLM reasoning over entire noisy trajectories at inference time.

The separation itself is the clearest new piece. It directly targets the interference problem mentioned in the abstract and keeps the heavy model out of the loop after feature construction. The refinement step adds a concrete way to sharpen step-level scores without extra generation.

The reported results on the Who&When benchmark claim better attribution accuracy than LLM baselines plus the 79% inference time reduction with zero generation overhead. Those numbers are the main selling point for anyone dealing with multi-agent reliability.

The soft spot is the lack of any experimental detail in the abstract: no baseline descriptions, no statistical tests, no ablations on the encoding step, and no information on dataset splits or variance. Without those, the performance claim cannot be checked. The load-bearing assumption is that the LLM encoding preserves the causal dependencies needed to identify the true root cause; if it over-summarizes or introduces its own distortions, the downstream modules have no way to recover. The paper presents the pipeline as solving the noise issue, but the abstract supplies no evidence that the encoding step was validated for fidelity.

This is for people working on debugging and reliability in multi-agent LLM setups. A reader looking for modular efficiency tricks could extract the structure even if the numbers need confirmation. The thinking is coherent on its own terms.

Send it for peer review so the experimental controls and encoding validation can be examined properly.

Referee Report

2 major / 1 minor

Summary. The paper proposes StepFinder, a lightweight failure attribution framework for LLM-based multi-agent systems. LLMs are used solely in the feature construction phase to encode execution logs into temporal semantic sequences; a parameter-efficient combination of temporal modeling and attention modules then captures sequential evolution and cross-step dependencies, followed by multi-scale difference and position-bias refinement to produce step-level error scores. On the Who&When benchmark the method is reported to outperform direct LLM-based attribution while reducing inference time by 79% relative to the fastest LLM baseline and incurring no text-generation overhead.

Significance. If the empirical claims are substantiated, the work offers a practical route to accurate, low-latency failure attribution that avoids the cost and noise sensitivity of end-to-end LLM reasoning over raw trajectories. The public release of code is a clear reproducibility asset.

major comments (2)

[Abstract] Abstract: the central performance claim (outperformance plus 79% time reduction) is presented without any description of the LLM baselines, statistical significance tests, error bars, train/test splits, or ablation controls; these omissions make the reported gains impossible to evaluate from the given information.
[§3] §3 (feature construction): the method's accuracy advantage rests on the untested premise that LLM-generated temporal semantic sequences retain sufficient causal signal and do not introduce new distortions (over-generalization or hallucinated relations) relative to raw logs; no ablation comparing encoded sequences against raw-log inputs or alternative encodings is described, leaving the load-bearing assumption unsupported.

minor comments (1)

[Abstract] The benchmark name 'Who&When' should be accompanied by a citation or brief description of its construction and size.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, clarifying the experimental details and the rationale for our feature construction approach while noting where revisions will strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claim (outperformance plus 79% time reduction) is presented without any description of the LLM baselines, statistical significance tests, error bars, train/test splits, or ablation controls; these omissions make the reported gains impossible to evaluate from the given information.

Authors: We agree that the abstract's brevity omits key experimental context. The LLM baselines (including model variants and prompting strategies) are fully specified in Section 4.2, statistical significance tests with error bars appear in the results tables of Section 4.3, train/test splits are detailed in Section 4.1, and ablation controls are reported in Section 5. To improve evaluability directly from the abstract, we will revise it to briefly identify the primary LLM baselines and note that full methodological details, including significance testing and splits, are provided in the experimental sections. revision: yes
Referee: [§3] §3 (feature construction): the method's accuracy advantage rests on the untested premise that LLM-generated temporal semantic sequences retain sufficient causal signal and do not introduce new distortions (over-generalization or hallucinated relations) relative to raw logs; no ablation comparing encoded sequences against raw-log inputs or alternative encodings is described, leaving the load-bearing assumption unsupported.

Authors: Section 3 motivates the temporal semantic encoding as a means to extract high-level causal relations from noisy, redundant logs while preserving sequential structure, with the subsequent temporal modeling and attention modules operating on these sequences. Although a direct ablation against raw-log inputs or alternative encodings is not reported, the end-to-end results on the Who&When benchmark demonstrate that StepFinder outperforms direct LLM attribution methods that reason over raw trajectories, providing indirect support that the encoding retains necessary signal without prohibitive distortion. We will revise Section 3 to include an expanded discussion of this design assumption, its potential limitations, and the indirect evidence from the main experiments. revision: partial

Circularity Check

0 steps flagged

No circularity: modular pipeline with experimental validation

full rationale

The paper presents StepFinder as an engineering pipeline (LLM encoding of logs into temporal semantic sequences, followed by temporal modeling + attention + multi-scale refinement) whose performance claims rest on benchmark experiments rather than any derivation, equation, or fitted parameter that reduces to its own inputs. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way; the method is described as a practical combination of existing components without tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated premise that the temporal semantic encoding step is information-preserving.

pith-pipeline@v0.9.1-grok · 5786 in / 1136 out tokens · 19439 ms · 2026-06-28T10:10:47.132015+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 12 linked inside Pith

[1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normaliza- tion.arXiv preprint arXiv:1607.06450(2016)

Pith/arXiv arXiv 2016
[2]

Adi Banerjee, Anirudh Nair, and Tarik Borogovac. 2025. Where did it all go wrong? A hierarchical look into multi-agent error attribution.arXiv preprint arXiv:2510.04886(2025)

arXiv 2025
[3]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

2020
[4]

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ram- chandran, et al . 2025. Why do multi-agent llm systems fail?arXiv preprint arXiv:2503.13657(2025)

Pith/arXiv arXiv 2025
[5]

Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kan- nappan, and Rebecca Qian. 2025. TRAIL: Trace Reasoning and Agentic Issue Localization.arXiv preprint arXiv:2505.08638(2025)

arXiv 2025
[6]

Wei Du, Shifei Ding, Lili Guo, Jian Zhang, and Ling Ding. 2024. Expressive multi-agent communication via identity-aware learning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 17354–17361

2024
[7]

Lutfi Eren Erdogan, Nicholas Lee, Sehoon Kim, Suhong Moon, Hiroki Furuta, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. 2025. Plan-and-act: Im- proving planning of agents for long-horizon tasks.arXiv preprint arXiv:2503.09572 (2025)

Pith/arXiv arXiv 2025
[8]

Mingyan Gao, Yanzi Li, Banruo Liu, Yifan Yu, Phillip Wang, Ching-Yu Lin, and Fan Lai. 2025. Single-agent or Multi-agent Systems? Why Not Both?arXiv preprint arXiv:2505.18286(2025)

arXiv 2025
[9]

Yu Ge, Linna Xie, Zhong Li, Yu Pei, and Tian Zhang. 2025. Who is introducing the failure? automatically attributing failures of multi-agent systems via spectrum analysis.arXiv preprint arXiv:2509.13782(2025)

arXiv 2025
[10]

Alireza Ghafarollahi and Markus J Buehler. 2025. SciAgents: automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials37, 22 (2025), 2413523

2025
[11]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.Neural computation9, 8 (1997), 1735–1780

1997
[12]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al
[13]

InThe twelfth international conference on learning representations

MetaGPT: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations
[14]

Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. 2025. Model context protocol (mcp): Landscape, security threats, and future research directions.arXiv preprint arXiv:2503.23278(2025)

Pith/arXiv arXiv 2025
[15]

Jen-tse Huang, Jiaxu Zhou, Tailin Jin, Xuhui Zhou, Zixi Chen, Wenxuan Wang, Youliang Yuan, Michael R Lyu, and Maarten Sap. 2024. On the resilience of llm-based multi-agent collaboration with faulty agents.arXiv preprint arXiv:2408.00989(2024)

arXiv 2024
[16]

Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, and Murat Kocaoglu. 2022. Root cause analysis of failures in microservices through causal discovery.Advances in Neural Information Processing Systems35 (2022), 31158–31170

2022
[17]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)

Pith/arXiv arXiv 2023
[18]

Fanqi Kong, Ruijie Zhang, Huaxiao Yin, Guibin Zhang, Xiaofei Zhang, Ziang Chen, Zhaowei Zhang, Xiaoyuan Zhang, Song-Chun Zhu, and Xue Feng. 2025. Aegis: Automated Error Generation and Attribution for Multi-Agent Systems. arXiv preprint arXiv:2509.14295(2025)

Pith/arXiv arXiv 2025
[19]

Hui Yi Leong, Yuheng Li, Yuqing Wu, Wenwen Ouyang, Wei Zhu, Jiechao Gao, and Wei Han. 2025. Amas: Adaptively determining communication topology for llm-based multi-agent system. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2061–2070

2025
[20]

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. Camel: Communicative agents for" mind" exploration of large language model society.Advances in Neural Information Processing Systems36 (2023), 51991–52008

2023
[21]

Shiyuan Li, Yixin Liu, Qingsong Wen, Chengqi Zhang, and Shirui Pan. 2025. Assemble your crew: Automatic multi-agent communication topology design via autoregressive graph generation.arXiv preprint arXiv:2507.18224(2025)

arXiv 2025
[22]

Cheng-Ming Lin, Ching Chang, Wei-Yao Wang, Kuang-Da Wang, and Wen-Chih Peng. 2024. Root cause analysis in microservice using neural granger causal discovery. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 206–213

2024
[23]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics12 (2024), 157–173

2024
[24]

Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. 2023. Dynamic llm-agent network: An llm-agent collaboration framework with agent team optimization.arXiv preprint arXiv:2310.02170(2023)

Pith/arXiv arXiv 2023
[25]

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha
[26]

arXiv preprint arXiv:2408.06292(2024)

The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292(2024)

Pith/arXiv arXiv 2024
[27]

Guoqing Ma, Jia Zhu, Hanghui Guo, Weijie Shi, Jiawei Shen, Jingjiang Liu, and Yidan Liang. 2025. Automatic Failure Attribution and Critical Step Prediction Method for Multi-Agent Systems Based on Causal Inference.arXiv preprint arXiv:2509.08682(2025)

arXiv 2025
[28]

Joshua Owotogbe. 2025. Assessing and Enhancing the Robustness of LLM- based Multi-Agent Systems Through Chaos Engineering. In2025 IEEE/ACM 4th International Conference on AI Engineering–Software Engineering for AI (CAIN). IEEE, 250–252

2025
[29]

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22

2023
[30]

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. 2024. Chatdev: Communicative agents for software development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15174–15186

2024
[31]

Xihe Qiu, Haoyu Wang, Xiaoyu Tan, Chao Qu, Yujie Xiong, Yuan Cheng, Yinghui Xu, Wei Chu, and Yuan Qi. 2024. Towards collaborative intelligence: Propagat- ing intentions and reasoning for multi-agent coordination with large language models.arXiv preprint arXiv:2407.12532(2024)

arXiv 2024
[32]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems36 (2023), 68539–68551

2023
[33]

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, and Emad Barsoum. 2025. Agent laboratory: Using llm agents as research assistants.arXiv preprint arXiv:2501.04227(2025)

Pith/arXiv arXiv 2025
[34]

Xu Shen, Yixin Liu, Yiwei Dai, Yili Wang, Rui Miao, Yue Tan, Shirui Pan, and Xin Wang. 2025. Understanding the Information Propagation Effects of Com- munication Topologies in LLM-based Multi-Agent Systems.arXiv preprint arXiv:2505.23352(2025)

arXiv 2025
[35]

Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kamb- hampati. 2023. On the planning abilities of large language models-a critical investigation.Advances in Neural Information Processing Systems36 (2023), 75993–76005

2023
[36]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291(2023)

Pith/arXiv arXiv 2023
[37]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171(2022)

Pith/arXiv arXiv 2022
[38]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

2022
[39]

Alva West, Yixuan Weng, Minjun Zhu, Zhen Lin, Zhiyuan Ning, and Yue Zhang
[40]

Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems.arXiv preprint arXiv:2509.10401(2025)

arXiv 2025
[41]

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. 2024. Autogen: Enabling next-gen LLM applications via multi-agent conversations. InFirst Conference on Language Modeling

2024
[42]

Yingxuan Yang, Huacan Chai, Shuai Shao, Yuanyi Song, Siyuan Qi, Renting Rui, and Weinan Zhang. 2025. Agentnet: Decentralized evolutionary coordination for llm-based multi-agent systems.arXiv preprint arXiv:2504.00587(2025)

arXiv 2025
[43]

Yijun Yang, Tianyi Zhou, Kanxue Li, Dapeng Tao, Lusong Li, Li Shen, Xiaodong He, Jing Jiang, and Yuhui Shi. 2024. Embodied multi-modal agent trained by an llm from a parallel textworld. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 26275–26285

2024
[44]

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems36 (2023), 11809–11822

2023
[45]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations

2022
[46]

Boyang Zhang, Yicong Tan, Yun Shen, Ahmed Salem, Michael Backes, Savvas Zannettou, and Yang Zhang. 2025. Breaking agents: Compromising autonomous llm agents through malfunction amplification. InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing. 34952–34964

2025
[47]

Guibin Zhang, Luyang Niu, Junfeng Fang, Kun Wang, Lei Bai, and Xiang Wang
[48]

Multi-agent architecture search via agentic supernet.arXiv preprint StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea arXiv:2502.04180(2025)

arXiv 2026
[49]

Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng Yan. 2025. AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?arXiv preprint arXiv:2509.03312(2025)

arXiv 2025
[50]

Jieyu Zhang, Ranjay Krishna, Ahmed H Awadallah, and Chi Wang. 2023. Ecoas- sistant: Using llm assistant more affordably and accurately.arXiv preprint arXiv:2310.03046(2023)

arXiv 2023
[51]

Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, et al . 2025. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212(2025)

arXiv 2025
[52]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. 2025. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.arXiv preprint arXiv:2506.05176(2025)

Pith/arXiv arXiv 2025
[53]

Yang Zhang, Shixin Yang, Chenjia Bai, Fei Wu, Xiu Li, Zhen Wang, and Xuelong Li. 2025. Towards efficient llm grounding for embodied multi-agent collaboration. InFindings of the Association for Computational Linguistics: ACL 2025. 1663–1699

2025
[54]

Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. 2023. Progressive-hint prompting improves reasoning in large language models.arXiv preprint arXiv:2304.09797(2023)

arXiv 2023
[55]

Han Zhou, Xingchen Wan, Ruoxi Sun, Hamid Palangi, Shariq Iqbal, Ivan Vulić, Anna Korhonen, and Sercan Ö Arık. 2025. Multi-agent design: Optimizing agents with better prompts and topologies.arXiv preprint arXiv:2502.02533(2025)

arXiv 2025
[56]

Chenyang Zhu, Spencer Hong, Jingyu Wu, Kushal Chawla, Charlotte Tang, Youb- ing Yin, Nathan Wolfe, Erin Babinsky, and Daben Liu. 2025. RAFFLES: Reasoning- based Attribution of Faults for LLM Systems.arXiv preprint arXiv:2509.06822 (2025)

arXiv 2025
[57]

Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al . 2025. Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370 (2025)

arXiv 2025
[58]

Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoor- thi, Yuandong Tian, et al. 2024. Agent-as-a-judge: Evaluate agents with agents. arXiv preprint arXiv:2410.10934(2024). A Pseudocode of StepFinder Algorithm 1StepFinder Training Input:MAS failure logsT={𝜏 1, 𝜏2, . ....

arXiv 2024

[1] [1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normaliza- tion.arXiv preprint arXiv:1607.06450(2016)

Pith/arXiv arXiv 2016

[2] [2]

Adi Banerjee, Anirudh Nair, and Tarik Borogovac. 2025. Where did it all go wrong? A hierarchical look into multi-agent error attribution.arXiv preprint arXiv:2510.04886(2025)

arXiv 2025

[3] [3]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

2020

[4] [4]

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ram- chandran, et al . 2025. Why do multi-agent llm systems fail?arXiv preprint arXiv:2503.13657(2025)

Pith/arXiv arXiv 2025

[5] [5]

Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kan- nappan, and Rebecca Qian. 2025. TRAIL: Trace Reasoning and Agentic Issue Localization.arXiv preprint arXiv:2505.08638(2025)

arXiv 2025

[6] [6]

Wei Du, Shifei Ding, Lili Guo, Jian Zhang, and Ling Ding. 2024. Expressive multi-agent communication via identity-aware learning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 17354–17361

2024

[7] [7]

Lutfi Eren Erdogan, Nicholas Lee, Sehoon Kim, Suhong Moon, Hiroki Furuta, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. 2025. Plan-and-act: Im- proving planning of agents for long-horizon tasks.arXiv preprint arXiv:2503.09572 (2025)

Pith/arXiv arXiv 2025

[8] [8]

Mingyan Gao, Yanzi Li, Banruo Liu, Yifan Yu, Phillip Wang, Ching-Yu Lin, and Fan Lai. 2025. Single-agent or Multi-agent Systems? Why Not Both?arXiv preprint arXiv:2505.18286(2025)

arXiv 2025

[9] [9]

Yu Ge, Linna Xie, Zhong Li, Yu Pei, and Tian Zhang. 2025. Who is introducing the failure? automatically attributing failures of multi-agent systems via spectrum analysis.arXiv preprint arXiv:2509.13782(2025)

arXiv 2025

[10] [10]

Alireza Ghafarollahi and Markus J Buehler. 2025. SciAgents: automating scientific discovery through bioinspired multi-agent intelligent graph reasoning.Advanced Materials37, 22 (2025), 2413523

2025

[11] [11]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.Neural computation9, 8 (1997), 1735–1780

1997

[12] [12]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al

[13] [13]

InThe twelfth international conference on learning representations

MetaGPT: Meta programming for a multi-agent collaborative framework. InThe twelfth international conference on learning representations

[14] [14]

Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. 2025. Model context protocol (mcp): Landscape, security threats, and future research directions.arXiv preprint arXiv:2503.23278(2025)

Pith/arXiv arXiv 2025

[15] [15]

Jen-tse Huang, Jiaxu Zhou, Tailin Jin, Xuhui Zhou, Zixi Chen, Wenxuan Wang, Youliang Yuan, Michael R Lyu, and Maarten Sap. 2024. On the resilience of llm-based multi-agent collaboration with faulty agents.arXiv preprint arXiv:2408.00989(2024)

arXiv 2024

[16] [16]

Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, and Murat Kocaoglu. 2022. Root cause analysis of failures in microservices through causal discovery.Advances in Neural Information Processing Systems35 (2022), 31158–31170

2022

[17] [17]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)

Pith/arXiv arXiv 2023

[18] [18]

Fanqi Kong, Ruijie Zhang, Huaxiao Yin, Guibin Zhang, Xiaofei Zhang, Ziang Chen, Zhaowei Zhang, Xiaoyuan Zhang, Song-Chun Zhu, and Xue Feng. 2025. Aegis: Automated Error Generation and Attribution for Multi-Agent Systems. arXiv preprint arXiv:2509.14295(2025)

Pith/arXiv arXiv 2025

[19] [19]

Hui Yi Leong, Yuheng Li, Yuqing Wu, Wenwen Ouyang, Wei Zhu, Jiechao Gao, and Wei Han. 2025. Amas: Adaptively determining communication topology for llm-based multi-agent system. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2061–2070

2025

[20] [20]

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. Camel: Communicative agents for" mind" exploration of large language model society.Advances in Neural Information Processing Systems36 (2023), 51991–52008

2023

[21] [21]

Shiyuan Li, Yixin Liu, Qingsong Wen, Chengqi Zhang, and Shirui Pan. 2025. Assemble your crew: Automatic multi-agent communication topology design via autoregressive graph generation.arXiv preprint arXiv:2507.18224(2025)

arXiv 2025

[22] [22]

Cheng-Ming Lin, Ching Chang, Wei-Yao Wang, Kuang-Da Wang, and Wen-Chih Peng. 2024. Root cause analysis in microservice using neural granger causal discovery. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 206–213

2024

[23] [23]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics12 (2024), 157–173

2024

[24] [24]

Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. 2023. Dynamic llm-agent network: An llm-agent collaboration framework with agent team optimization.arXiv preprint arXiv:2310.02170(2023)

Pith/arXiv arXiv 2023

[25] [25]

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha

[26] [26]

arXiv preprint arXiv:2408.06292(2024)

The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292(2024)

Pith/arXiv arXiv 2024

[27] [27]

Guoqing Ma, Jia Zhu, Hanghui Guo, Weijie Shi, Jiawei Shen, Jingjiang Liu, and Yidan Liang. 2025. Automatic Failure Attribution and Critical Step Prediction Method for Multi-Agent Systems Based on Causal Inference.arXiv preprint arXiv:2509.08682(2025)

arXiv 2025

[28] [28]

Joshua Owotogbe. 2025. Assessing and Enhancing the Robustness of LLM- based Multi-Agent Systems Through Chaos Engineering. In2025 IEEE/ACM 4th International Conference on AI Engineering–Software Engineering for AI (CAIN). IEEE, 250–252

2025

[29] [29]

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22

2023

[30] [30]

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. 2024. Chatdev: Communicative agents for software development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15174–15186

2024

[31] [31]

Xihe Qiu, Haoyu Wang, Xiaoyu Tan, Chao Qu, Yujie Xiong, Yuan Cheng, Yinghui Xu, Wei Chu, and Yuan Qi. 2024. Towards collaborative intelligence: Propagat- ing intentions and reasoning for multi-agent coordination with large language models.arXiv preprint arXiv:2407.12532(2024)

arXiv 2024

[32] [32]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems36 (2023), 68539–68551

2023

[33] [33]

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, and Emad Barsoum. 2025. Agent laboratory: Using llm agents as research assistants.arXiv preprint arXiv:2501.04227(2025)

Pith/arXiv arXiv 2025

[34] [34]

Xu Shen, Yixin Liu, Yiwei Dai, Yili Wang, Rui Miao, Yue Tan, Shirui Pan, and Xin Wang. 2025. Understanding the Information Propagation Effects of Com- munication Topologies in LLM-based Multi-Agent Systems.arXiv preprint arXiv:2505.23352(2025)

arXiv 2025

[35] [35]

Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kamb- hampati. 2023. On the planning abilities of large language models-a critical investigation.Advances in Neural Information Processing Systems36 (2023), 75993–76005

2023

[36] [36]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291(2023)

Pith/arXiv arXiv 2023

[37] [37]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171(2022)

Pith/arXiv arXiv 2022

[38] [38]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

2022

[39] [39]

Alva West, Yixuan Weng, Minjun Zhu, Zhen Lin, Zhiyuan Ning, and Yue Zhang

[40] [40]

Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems.arXiv preprint arXiv:2509.10401(2025)

arXiv 2025

[41] [41]

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. 2024. Autogen: Enabling next-gen LLM applications via multi-agent conversations. InFirst Conference on Language Modeling

2024

[42] [42]

Yingxuan Yang, Huacan Chai, Shuai Shao, Yuanyi Song, Siyuan Qi, Renting Rui, and Weinan Zhang. 2025. Agentnet: Decentralized evolutionary coordination for llm-based multi-agent systems.arXiv preprint arXiv:2504.00587(2025)

arXiv 2025

[43] [43]

Yijun Yang, Tianyi Zhou, Kanxue Li, Dapeng Tao, Lusong Li, Li Shen, Xiaodong He, Jing Jiang, and Yuhui Shi. 2024. Embodied multi-modal agent trained by an llm from a parallel textworld. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 26275–26285

2024

[44] [44]

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems36 (2023), 11809–11822

2023

[45] [45]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations

2022

[46] [46]

Boyang Zhang, Yicong Tan, Yun Shen, Ahmed Salem, Michael Backes, Savvas Zannettou, and Yang Zhang. 2025. Breaking agents: Compromising autonomous llm agents through malfunction amplification. InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing. 34952–34964

2025

[47] [47]

Guibin Zhang, Luyang Niu, Junfeng Fang, Kun Wang, Lei Bai, and Xiang Wang

[48] [48]

Multi-agent architecture search via agentic supernet.arXiv preprint StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea arXiv:2502.04180(2025)

arXiv 2026

[49] [49]

Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng Yan. 2025. AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?arXiv preprint arXiv:2509.03312(2025)

arXiv 2025

[50] [50]

Jieyu Zhang, Ranjay Krishna, Ahmed H Awadallah, and Chi Wang. 2023. Ecoas- sistant: Using llm assistant more affordably and accurately.arXiv preprint arXiv:2310.03046(2023)

arXiv 2023

[51] [51]

Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, et al . 2025. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212(2025)

arXiv 2025

[52] [52]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. 2025. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.arXiv preprint arXiv:2506.05176(2025)

Pith/arXiv arXiv 2025

[53] [53]

Yang Zhang, Shixin Yang, Chenjia Bai, Fei Wu, Xiu Li, Zhen Wang, and Xuelong Li. 2025. Towards efficient llm grounding for embodied multi-agent collaboration. InFindings of the Association for Computational Linguistics: ACL 2025. 1663–1699

2025

[54] [54]

Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. 2023. Progressive-hint prompting improves reasoning in large language models.arXiv preprint arXiv:2304.09797(2023)

arXiv 2023

[55] [55]

Han Zhou, Xingchen Wan, Ruoxi Sun, Hamid Palangi, Shariq Iqbal, Ivan Vulić, Anna Korhonen, and Sercan Ö Arık. 2025. Multi-agent design: Optimizing agents with better prompts and topologies.arXiv preprint arXiv:2502.02533(2025)

arXiv 2025

[56] [56]

Chenyang Zhu, Spencer Hong, Jingyu Wu, Kushal Chawla, Charlotte Tang, Youb- ing Yin, Nathan Wolfe, Erin Babinsky, and Daben Liu. 2025. RAFFLES: Reasoning- based Attribution of Faults for LLM Systems.arXiv preprint arXiv:2509.06822 (2025)

arXiv 2025

[57] [57]

Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al . 2025. Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370 (2025)

arXiv 2025

[58] [58]

Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoor- thi, Yuandong Tian, et al. 2024. Agent-as-a-judge: Evaluate agents with agents. arXiv preprint arXiv:2410.10934(2024). A Pseudocode of StepFinder Algorithm 1StepFinder Training Input:MAS failure logsT={𝜏 1, 𝜏2, . ....

arXiv 2024