arxiv: 2604.17658 · v1 · submitted 2026-04-19 · 💻 cs.MA · cs.CL

Recognition: unknown

Towards Self-Improving Error Diagnosis in Multi-Agent Systems

Jiazheng Li , Emine Yilmaz , Bei Chen , Dieu-Thu Le

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:44 UTC · model grok-4.3

classification 💻 cs.MA cs.CL

keywords multi-agent systemserror diagnosisfailure attributionLLM debuggingepisodic memoryself-improving systemsagent tracingsemantic failure

0 comments

The pith

ErrorProbe diagnoses errors in multi-agent LLM systems by tracing symptoms backward and validating hypotheses with a tool-using agent team and verified memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ErrorProbe as a framework to pinpoint which agent and which step caused a failure in long, interdependent multi-agent interactions without relying on expert labels or imprecise LLM judges. It builds a three-stage process that first flags local anomalies from a failure taxonomy, then prunes the trace by working backward from observed symptoms, and finally tests candidate explanations through a Strategist-Investigator-Arbiter team that runs executable checks. A verified episodic memory records only those patterns that survive the checks, letting the system improve and transfer to new domains without retraining. Readers should care because accurate, low-cost diagnosis is a prerequisite for making complex agent systems reliable enough for real use.

Core claim

ErrorProbe attributes semantic failures in multi-agent systems to the responsible agent and originating step through a three-stage pipeline: operationalizing the MAS failure taxonomy to detect local anomalies, performing symptom-driven backward tracing to prune irrelevant context, and employing a specialized multi-agent team to validate error hypotheses through tool-grounded execution. It maintains a verified episodic memory that updates only when error patterns are confirmed by executable evidence, without any annotation requirement.

What carries the argument

The verified episodic memory, which stores only error patterns confirmed by tool-grounded execution from the Strategist-Investigator-Arbiter validation team.

If this is right

ErrorProbe achieves higher step-level localization accuracy than existing annotation-based or LLM-as-judge methods on the TracerTraj and Who&When benchmarks.
The verified memory supports robust performance on new domains without any retraining or additional labels.
The framework removes the need for expensive expert annotations while still producing traceable, evidence-backed diagnoses.
Confirmed error patterns accumulate over time, enabling progressive self-improvement of the diagnostic capability itself.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tracing-plus-verified-memory pattern could be adapted to single-agent chains or tool-use failures where errors also manifest after many steps.
Automated diagnosis of this form might be inserted into agent development loops so that each failure directly triggers targeted fixes in the next iteration.
If the memory remains clean, the approach could scale to very long traces that current context-window methods cannot handle.

Load-bearing premise

The three-agent validation team can reliably confirm or reject error hypotheses using executable tool checks and the memory will incorporate only genuine patterns without adding false positives or biases.

What would settle it

On the Who&When benchmark, run ErrorProbe with the memory disabled versus enabled and check whether step-level localization accuracy or cross-domain transfer performance drops below the best baseline when memory updates are active.

Figures

Figures reproduced from arXiv: 2604.17658 by Bei Chen, Dieu-Thu Le, Emine Yilmaz, Jiazheng Li.

**Figure 2.** Figure 2: Overview of ERRORPROBE. The system prunes long traces via dependency parsing, then employs a Strategist-Investigator-Arbiter team to diagnose the root cause, updating memory only upon successful verification. approves without testing, or Incorrect Verification where valid solutions are rejected). Standard evaluation typically treats this as an i.i.d. classification problem, maximizing the likelihood P(y|x… view at source ↗

**Figure 3.** Figure 3: Learning curves for Agent and Step accuracy [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Large Language Model (LLM)-based Multi-Agent Systems (MAS) enable complex problem-solving but introduce significant debugging challenges, characterized by long interaction traces, inter-agent dependencies, and delayed error manifestation. Existing diagnostic approaches often rely on expensive expert annotation or ''LLM-as-a-judge'' paradigms, which struggle to pinpoint decisive error steps within extended contexts. In this paper, we introduce ErrorProbe, a self-improving framework for semantic failure attribution that identifies responsible agents and the originating error step. The framework operates via a three-stage pipeline: (1) operationalizing the MAS failure taxonomy to detect local anomalies, (2) performing symptom-driven backward tracing to prune irrelevant context, and (3) employing a specialized multi-agent team (Strategist, Investigator, Arbiter) to validate error hypotheses through tool-grounded execution. Crucially, ErrorProbe maintains a verified episodic memory that updates only when error patterns are confirmed by executable evidence, without the need for annotation. Experiments across the TracerTraj and Who&When benchmarks demonstrate that ErrorProbe significantly outperforms baselines, particularly in step-level localization, while the verified memory enables robust cross-domain transfer without retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ErrorProbe gives a structured three-stage pipeline plus verified memory for diagnosing errors in multi-agent LLM systems, but the abstract supplies no numbers and the LLM-driven confirmation step risks feeding bad patterns back into the memory.

read the letter

The main thing to know is that this paper puts forward ErrorProbe, a self-improving diagnostic system for semantic failures in multi-agent LLM setups. It runs a taxonomy to spot local anomalies, prunes context with symptom-driven backward tracing, then uses a Strategist-Investigator-Arbiter team to test hypotheses via tool execution, and only writes confirmed patterns into episodic memory. The no-annotation self-update loop is the piece that feels incrementally new compared with plain LLM-as-judge baselines.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ErrorProbe, a self-improving framework for semantic failure attribution in LLM-based multi-agent systems. It operates via a three-stage pipeline: operationalizing a MAS failure taxonomy to detect local anomalies, symptom-driven backward tracing to prune context, and a specialized multi-agent team (Strategist, Investigator, Arbiter) to validate error hypotheses through tool-grounded execution. A verified episodic memory updates only on patterns confirmed by executable evidence, enabling self-improvement and robust cross-domain transfer without retraining or annotation. Experiments on the TracerTraj and Who&When benchmarks are claimed to demonstrate significant outperformance over baselines, particularly in step-level localization.

Significance. If the reliability of the verification mechanism and the reported empirical gains hold, the work offers a promising direction for automated debugging of complex MAS by reducing dependence on expert labels and enabling memory-based transfer. The emphasis on tool-grounded confirmation and episodic memory updates is a concrete strength that could support reproducible extensions in the field.

major comments (2)

[§3.3] §3.3 (Multi-agent validation team): The central self-improving claim depends on the episodic memory updating 'only when error patterns are confirmed by executable evidence' via the Strategist-Investigator-Arbiter team. Because confirmation occurs through LLM-driven hypothesis validation and tool calls rather than an external oracle, the mechanism risks false positives from the same class of LLM judgment errors the system aims to diagnose; no explicit mechanism for detecting or bounding such validation errors is described, directly threatening both the 'verified' property and the no-retraining cross-domain transfer.
[§4] §4 (Experiments and benchmarks): The claims of significant outperformance and cross-domain transfer rest on results from TracerTraj and Who&When, yet the manuscript supplies insufficient detail on baseline implementations, statistical tests, error bars, or how the benchmarks were constructed and how step-level localization was scored. This weakens support for the load-bearing assertion that the verified memory enables robust transfer.

minor comments (2)

[Abstract] Abstract: the phrase 'significantly outperforms' would benefit from one or two key quantitative metrics to convey the magnitude of improvement immediately.
[§3] Ensure consistent terminology for 'verified episodic memory' and 'tool-grounded execution' across sections to avoid minor ambiguity in the pipeline description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on ErrorProbe. The comments highlight important aspects of the verification process and experimental reporting. We respond point-by-point below, clarifying the role of tool-grounded evidence and committing to expanded details where needed.

read point-by-point responses

Referee: [§3.3] §3.3 (Multi-agent validation team): The central self-improving claim depends on the episodic memory updating 'only when error patterns are confirmed by executable evidence' via the Strategist-Investigator-Arbiter team. Because confirmation occurs through LLM-driven hypothesis validation and tool calls rather than an external oracle, the mechanism risks false positives from the same class of LLM judgment errors the system aims to diagnose; no explicit mechanism for detecting or bounding such validation errors is described, directly threatening both the 'verified' property and the no-retraining cross-domain transfer.

Authors: The validation in the Strategist-Investigator-Arbiter team is deliberately anchored in executable tool outputs rather than standalone LLM judgments. The Investigator issues tool calls whose results (e.g., concrete return values, error codes, or state changes) constitute external, verifiable evidence; the Arbiter then evaluates these concrete artifacts against the hypothesis. This design distinguishes the process from pure LLM-as-a-judge approaches. Nevertheless, we acknowledge that LLM interpretation of tool outputs could introduce secondary errors. To address this, we will revise §3.3 to describe an explicit bounding step: the Arbiter requires at least two independent tool executions for confirmation and logs any interpretive discrepancies for later review. This addition will reinforce the 'verified' property without altering the core pipeline. revision: yes
Referee: [§4] §4 (Experiments and benchmarks): The claims of significant outperformance and cross-domain transfer rest on results from TracerTraj and Who&When, yet the manuscript supplies insufficient detail on baseline implementations, statistical tests, error bars, or how the benchmarks were constructed and how step-level localization was scored. This weakens support for the load-bearing assertion that the verified memory enables robust transfer.

Authors: We agree that additional experimental transparency is required to substantiate the performance and transfer claims. In the revised manuscript we will expand §4 with: (i) complete baseline implementation details, including prompting strategies and any public code references; (ii) results of statistical significance tests (paired t-tests across runs) together with error bars from five independent trials; (iii) explicit descriptions of benchmark construction, including trace generation procedures and ground-truth step annotations; and (iv) the precise scoring protocol for step-level localization (exact-match accuracy with tolerance for adjacent steps). These revisions will directly support the assertion that verified episodic memory enables cross-domain transfer. revision: yes

Circularity Check

0 steps flagged

No circularity: framework and claims are self-contained via external benchmarks and tool execution

full rationale

The paper introduces ErrorProbe as a three-stage pipeline (taxonomy operationalization, backward tracing, multi-agent hypothesis validation via tool-grounded execution) plus verified episodic memory updated only on confirmed executable evidence. No equations or derivations are present. The central claims rest on benchmark comparisons (TracerTraj, Who&When) and external tool calls rather than any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation. The memory update rule is defined in terms of independent tool execution, not in terms of the system's own outputs or prior results. This satisfies the criteria for a self-contained empirical framework with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claim rests on domain assumptions about the structure of MAS failures and the reliability of the validation team; no explicit free parameters or invented physical entities are described.

axioms (2)

domain assumption MAS failures can be localized to specific agents and originating steps through anomaly detection and symptom-driven tracing
Invoked as the basis for stages 1 and 2 of the pipeline.
domain assumption Tool-grounded execution by the Strategist-Investigator-Arbiter team can confirm or refute error hypotheses without external annotation
Core premise enabling the self-improving memory update.

invented entities (3)

ErrorProbe framework no independent evidence
purpose: End-to-end self-improving error diagnosis system
The overall proposed system.
verified episodic memory no independent evidence
purpose: Stores only confirmed error patterns to enable transfer without retraining
Key mechanism for self-improvement.
Strategist-Investigator-Arbiter multi-agent team no independent evidence
purpose: Validates error hypotheses via tool-grounded execution
Specialized component for stage 3.

pith-pipeline@v0.9.0 · 5505 in / 1591 out tokens · 43382 ms · 2026-05-10T04:44:09.117558+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 16 canonical work pages · 4 internal anchors

[1]

Sirui Hong and Mingchen Zhuge and Jonathan Chen and Xiawu Zheng and Yuheng Cheng and Jinlin Wang and Ceyao Zhang and Zili Wang and Steven Ka Shing Yau and Zijuan Lin and Liyang Zhou and Chenyu Ran and Lingfeng Xiao and Chenglin Wu and J. Meta. The Twelfth International Conference on Learning Representations , year=
[3]

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , url =

Yao, Shunyu and Chen, Howard and Yang, John and Narasimhan, Karthik , booktitle =. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , url =
[4]

2025 , eprint=

Towards an AI co-scientist , author=. 2025 , eprint=

2025
[5]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Toolformer: Language Models Can Teach Themselves to Use Tools , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[6]

AutoGen: Enabling Next-Gen

Qingyun Wu and Gagan Bansal and Jieyu Zhang and Yiran Wu and Beibin Li and Erkang Zhu and Li Jiang and Xiaoyun Zhang and Shaokun Zhang and Jiale Liu and Ahmed Hassan Awadallah and Ryen W White and Doug Burger and Chi Wang , booktitle=. AutoGen: Enabling Next-Gen. 2024 , url=

2024
[8]

Gonzalez , booktitle=

Shishir G Patil and Tianjun Zhang and Xin Wang and Joseph E. Gonzalez , booktitle=. Gorilla: Large Language Model Connected with Massive. 2024 , url=

2024
[9]

Gonzalez and Ion Stoica , booktitle=

Mert Cemri and Melissa Z Pan and Shuyi Yang and Lakshya A Agrawal and Bhavya Chopra and Rishabh Tiwari and Kurt Keutzer and Aditya Parameswaran and Dan Klein and Kannan Ramchandran and Matei Zaharia and Joseph E. Gonzalez and Ion Stoica , booktitle=. Why Do Multi-Agent. 2025 , url=

2025
[10]

Which Agent Causes Task Failures and When? On Automated Failure Attribution of

Shaokun Zhang and Ming Yin and Jieyu Zhang and Jiale Liu and Zhiguang Han and Jingyang Zhang and Beibin Li and Chi Wang and Huazheng Wang and Yiran Chen and Qingyun Wu , booktitle=. Which Agent Causes Task Failures and When? On Automated Failure Attribution of. 2025 , url=

2025
[11]

and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy

Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024

2024
[13]

Yifan Yu and Moyan Li and Shaoyuan Xu and Jinmiao Fu and Xinhai Hou and Fan Lai and Bryan Wang , year=
[14]

NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling , year=

Where Did It All Go Wrong? A Hierarchical Look into Multi-Agent Error Attribution , author=. NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling , year=

2025
[15]

Proceedings of the 42nd International Conference on Machine Learning , year =

Agent-as-a-Judge: Evaluate Agents with Agents , author =. Proceedings of the 42nd International Conference on Machine Learning , year =
[16]

AgenTracer: Who Is Inducing Failure in the

Guibin Zhang and Junhao Wang and Junjie Chen and Wangchunshu Zhou and Kun Wang and Shuicheng YAN , booktitle=. AgenTracer: Who Is Inducing Failure in the. 2026 , url=

2026
[18]

2025 , eprint =

Who is Introducing the Failure? Automatically Attributing Failures of Multi-Agent Systems via Spectrum Analysis , author =. 2025 , eprint =

2025
[19]

Zhu, Haowei and Qiang, Feiyang and Zhang, Haiyang and Wang, Bowen and Ruan, Wenjie , year =. Where. 2509.25370 , archivePrefix =

work page arXiv
[20]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle =. Judging. 2023 , url =

2023
[21]

2023 , pages =

Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang , booktitle =. 2023 , pages =

2023
[22]

The Twelfth International Conference on Learning Representations , year =

Prometheus: Inducing Fine-grained Evaluation Capability in Language Models , author =. The Twelfth International Conference on Learning Representations , year =
[23]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , year =

Bavaresco, Anna and Bernardi, Raffaella and Bertolazzi, Leonardo and Elliott, Desmond and Fern. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , year =
[24]

Justice or Prejudice? Quantifying Biases in

Jiayi Ye and Yanbo Wang and Yue Huang and Dongping Chen and Qihui Zhang and Nuno Moniz and Tian Gao and Werner Geyer and Chao Huang and Pin-Yu Chen and Nitesh V Chawla and Xiangliang Zhang , booktitle=. Justice or Prejudice? Quantifying Biases in. 2025 , url=

2025
[25]

arXiv preprint arXiv:2410.21819 (2025)

Wataoka, Koki and Takahashi, Tsubasa and Ri, Ryokan , year =. Self-Preference Bias in. 2410.21819 , archivePrefix =

work page arXiv
[26]

Advances in Neural Information Processing Systems , volume =

Reflexion: Language Agents with Verbal Reinforcement Learning , author =. Advances in Neural Information Processing Systems , volume =. 2023 , url =

2023
[27]

Advances in Neural Information Processing Systems , volume =

Self-Refine: Iterative Refinement with Self-Feedback , author =. Advances in Neural Information Processing Systems , volume =. 2023 , url =

2023
[28]

Transactions on Machine Learning Research , issn=

Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

2024
[31]

LLM-based Multi-Agent Systems: Towards Responsible, Reliable, and Scalable Agentic Systems , year=

Self-evolving Agents with reflective and memory-augmented abilities , author=. LLM-based Multi-Agent Systems: Towards Responsible, Reliable, and Scalable Agentic Systems , year=
[32]

The Fourteenth International Conference on Learning Representations , year=

EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems , author=. The Fourteenth International Conference on Learning Representations , year=
[33]

Position: Episodic memory is the missing piece for long-term LLM agents.CoRR, abs/2502.06975,

Pink, Matthew and Ahn, Hyunmin and Evans, Michael M. and Evans, James M. , year =. Episodic Memory is the Missing Piece for Long-Term. 2502.06975 , archivePrefix =

work page arXiv
[37]

Cut the Crap: An Economical Communication Pipeline for

Guibin Zhang and Yanwei Yue and Zhixun Li and Sukwon Yun and Guancheng Wan and Kun Wang and Dawei Cheng and Jeffrey Xu Yu and Tianlong Chen , booktitle=. Cut the Crap: An Economical Communication Pipeline for. 2025 , url=

2025
[38]

Calibrating LLM s with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring

Li, Jiazheng and Xu, Hainiu and Sun, Zhaoyue and Zhou, Yuxiang and West, David and Aloisi, Cesare and He, Yulan. Calibrating LLM s with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024

2024
[39]

Position:

Hanqi Yan and Linhai Zhang and Jiazheng Li and Zhenyi Shen and Yulan He , booktitle=. Position:. 2025 , url=

2025
[40]

Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time

Li, Jiazheng and Zhou, Yuxiang and Lu, Junru and Tyen, Gladys and Gui, Lin and Aloisi, Cesare and He, Yulan. Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025

2025
[41]

The Mystery of In-Context Learning: A Comprehensive Survey on Interpretation and Analysis

Zhou, Yuxiang and Li, Jiazheng and Xiang, Yanzheng and Yan, Hanqi and Gui, Lin and He, Yulan. The Mystery of In-Context Learning: A Comprehensive Survey on Interpretation and Analysis. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024

2024
[42]

Drift: Enhancing LLM Faithfulness in Rationale Generation via Dual-Reward Probabilistic Inference

Li, Jiazheng and Yan, Hanqi and He, Yulan. Drift: Enhancing LLM Faithfulness in Rationale Generation via Dual-Reward Probabilistic Inference. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. 2025

2025
[43]

2023 , url =

Li, Jiazheng and Sun, Zhaoyue and Liang, Bin and Gui, Lin and He, Yulan , booktitle =. 2023 , url =

2023
[44]

Distilling C hat GPT for Explainable Automated Student Answer Assessment

Li, Jiazheng and Gui, Lin and Zhou, Yuxiang and West, David and Aloisi, Cesare and He, Yulan. Distilling C hat GPT for Explainable Automated Student Answer Assessment. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023

2023
[45]

2026 , eprint=

Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain , author=. 2026 , eprint=

2026
[46]

Adi Banerjee, Anirudh Nair, and Tarik Borogovac. 2025. https://openreview.net/forum?id=0MyUdq7wLe Where did it all go wrong? a hierarchical look into multi-agent error attribution . In NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

2025
[47]

Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fern \'a ndez, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, Andr \'e F. T. Martins, Philipp Mondorf, Vera Neplenbroek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece Takmaz, and Alberto Testoni. 202...

2025
[48]

Gonzalez, and Ion Stoica

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. 2025. https://openreview.net/forum?id=fAjbYBmonr Why do multi-agent LLM systems fail? In The Thirty-ninth Annual Conference on Neural Information Proce...

2025
[49]

Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kannappan, and Rebecca Qian. 2025. https://arxiv.org/abs/2505.08638 TRAIL : Trace reasoning and agentic issue localization . Preprint, arXiv:2505.08638

work page arXiv 2025
[50]

Yu Ge, Linna Xie, Zhong Li, Yu Pei, and Tian Zhang. 2025. https://arxiv.org/abs/2509.13782 Who is introducing the failure? automatically attributing failures of multi-agent systems via spectrum analysis . Preprint, arXiv:2509.13782

work page arXiv 2025
[51]

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, and 15 others. 2025. https://arxiv.org/abs/2502.18864 Towards an ai co-scie...

work page internal anchor Pith review arXiv 2025
[52]

Yufei He, Juncheng Liu, Yue Liu, Yibo Li, Tri Cao, Zhiyuan Hu, Xinxing Xu, and Bryan Hooi. 2026. https://openreview.net/forum?id=JFnnajbkvP Evotest: Evolutionary test-time learning for self-improving agentic systems . In The Fourteenth International Conference on Learning Representations

2026
[53]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J \"u rgen Schmidhuber. 2024. https://openreview.net/forum?id=VtmBAGCN7o Meta GPT : Meta programming for a multi-agent collaborative framework . In The Twelfth ...

2024
[54]

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, and 28 others. 2025. https://arxiv.org/abs/2512.13564 Memory in the age of AI agents . Preprint, arXiv:2512.13564

work page internal anchor Pith review arXiv 2025
[55]

Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. 2024. https://openreview.net/forum?id=8euJaTveKw Prometheus: Inducing fine-grained evaluation capability in language models . In The Twelfth International Conference on Learning Representations

2024
[56]

Jiazheng Li, Lin Gui, Yuxiang Zhou, David West, Cesare Aloisi, and Yulan He. 2023 a . https://aclanthology.org/2023.findings-emnlp.399/ Distilling C hat GPT for explainable automated student answer assessment . In Findings of the Association for Computational Linguistics: EMNLP 2023

2023
[57]

Jiazheng Li, Zhaoyue Sun, Bin Liang, Lin Gui, and Yulan He. 2023 b . https://proceedings.mlr.press/v216/li23d.html CUE : An uncertainty interpretation framework for text classifiers built on pre-trained language models . In Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence

2023
[58]

Jiazheng Li, Hainiu Xu, Zhaoyue Sun, Yuxiang Zhou, David West, Cesare Aloisi, and Yulan He. 2024. https://aclanthology.org/2024.findings-emnlp.313/ Calibrating LLM s with preference optimization on thought trees for generating rationale in science question scoring . In Findings of the Association for Computational Linguistics: EMNLP 2024

2024
[59]

Jiazheng Li, Hanqi Yan, and Yulan He. 2025 a . https://aclanthology.org/2025.acl-long.340/ Drift: Enhancing LLM faithfulness in rationale generation via dual-reward probabilistic inference . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics

2025
[60]

Jiazheng Li, Yuxiang Zhou, Junru Lu, Gladys Tyen, Lin Gui, Cesare Aloisi, and Yulan He. 2025 b . https://aclanthology.org/2025.emnlp-main.155/ Two heads are better than one: Dual-model verbal reflection at inference-time . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

2025
[61]

Xuechen Liang, Yangfan He, Yinghui Xia, Xinyuan Song, Meiling Tao, Kuan Lu, Jianhui Wang, Li Sun, Xinhang Yuan, Keqin Li, Jiaqi Chen, TIANYU SHI, and Yang Jingsong. 2026. https://openreview.net/forum?id=6Mw2fO3ejN Self-evolving agents with reflective and memory-augmented abilities . In LLM-based Multi-Agent Systems: Towards Responsible, Reliable, and Scal...

2026
[62]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. https://aclanthology.org/2024.tacl-1.9/ Lost in the middle: How language models use long contexts . Transactions of the Association for Computational Linguistics, 12

2024
[63]

Wei Liu, Siya Qi, Yali Du, and Yulan He. 2026. https://arxiv.org/abs/2603.02218 Self-play only evolves when self-synthetic pipeline ensures learnable information gain . Preprint, arXiv:2603.02218

work page arXiv 2026
[64]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.153 G -eval: NLG evaluation using GPT -4 with better human alignment . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511--2522

work page doi:10.18653/v1/2023.emnlp-main.153 2023
[65]

Xuyan Ma, Xiaofei Xie, Yawen Wang, Junjie Wang, Boyu Wu, Mingyang Li, and Qing Wang. 2025. https://doi.org/10.48550/arXiv.2509.23735 Diagnosing failure root causes in platform-orchestrated agentic systems: Dataset, taxonomy, and benchmark . Preprint, arXiv:2509.23735

work page doi:10.48550/arxiv.2509.23735 2025
[66]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abs...

2023
[67]

OpenAI . 2024. https://github.com/openai/swarm Swarm

2024
[68]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Joseph E. Gonzalez, and Ian J. Goodfellow. 2023. https://arxiv.org/abs/2310.08560 MemGPT : Towards LLM s as operating systems . Preprint, arXiv:2310.08560

work page internal anchor Pith review arXiv 2023
[69]

Bernstein

Joon Sung Park, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. https://doi.org/10.1145/3586183.3606763 Generative agents: Interactive simulacra of human behavior . In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST '23. ACM

work page doi:10.1145/3586183.3606763 2023
[70]

Gonzalez

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2024. https://openreview.net/forum?id=tBRNC6YemY Gorilla: Large language model connected with massive API s . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

2024
[71]

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. https://doi.org/10.18653/v1/2024.acl-long.810 C hat D ev: Communicative agents for software development . In Proceedings of the 62nd Annual Meeting of the Association for Computational L...

work page doi:10.18653/v1/2024.acl-long.810 2024
[72]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. https://openreview.net/forum?id=Yacmpz84TH Toolformer: Language models can teach themselves to use tools . In Thirty-seventh Conference on Neural Information Processing Systems

2023
[73]

Weizhou Shen, Ziyi Yang, Chenliang Li, Zhiyuan Lu, Miao Peng, Huashan Sun, Yingcheng Shi, Shengyi Liao, Shaopeng Lai, Bo Zhang, Dayiheng Liu, Fei Huang, Jingren Zhou, and Ming Yan. 2025. https://arxiv.org/abs/2512.12967 QwenLong-L1.5 : Post-training recipe for long-context reasoning and memory management . Preprint, arXiv:2512.12967

work page arXiv 2025
[74]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/hash/1b44b878bb782e6954cd888628510e90-Abstract-Conference.html Reflexion: Language agents with verbal reinforcement learning . In Advances in Neural Information Processing Systems, volume 36

2023
[75]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2024. https://openreview.net/forum?id=ehfRiF0R3a Voyager: An open-ended embodied agent with large language models . Transactions on Machine Learning Research

2024
[76]

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. 2024. https://openreview.net/forum?id=BAakY1hNKS Autogen: Enabling next-gen LLM applications via multi-agent conversations . In First Conference on Language Modeling

2024
[77]

Hanqi Yan, Linhai Zhang, Jiazheng Li, Zhenyi Shen, and Yulan He. 2025. https://openreview.net/forum?id=RrvhbxO2hd Position: LLM s need a bayesian meta-reasoning framework for more robust and generalizable reasoning . In Forty-second International Conference on Machine Learning Position Paper Track

2025
[78]

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2022 a . https://proceedings.neurips.cc/paper_files/paper/2022/file/82ad13ec01f9fe44c01cb91814fd7b8c-Paper-Conference.pdf Webshop: Towards scalable real-world web interaction with grounded language agents . In Advances in Neural Information Processing Systems

2022
[79]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022 b . React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2022
[80]

Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V Chawla, and Xiangliang Zhang. 2025. https://openreview.net/forum?id=3GTtZFiajM Justice or prejudice? quantifying biases in LLM -as-a-judge . In The Thirteenth International Conference on Learning Representations

2025
[81]

Yifan Yu, Moyan Li, Shaoyuan Xu, Jinmiao Fu, Xinhai Hou, Fan Lai, and Bryan Wang. 2026. https://openreview.net/forum?id=6skwd1QtTO CORRECT : CO ndensed error REC ognition via knowledge transfer in multi-agent systems

2026
[82]

Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng YAN. 2026. https://openreview.net/forum?id=l05DseqvuD Agentracer: Who is inducing failure in the LLM agentic systems? In The Fourteenth International Conference on Learning Representations

2026
[83]

Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Xu Yu, and Tianlong Chen. 2025 a . https://openreview.net/forum?id=LkzuPorQ5L Cut the crap: An economical communication pipeline for LLM -based multi-agent systems . In The Thirteenth International Conference on Learning Representations

2025
[84]

Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, and Qingyun Wu. 2025 b . https://openreview.net/forum?id=GazlTYxZss Which agent causes task failures and when? on automated failure attribution of LLM multi-agent systems . In Forty-second International Conference on Machine Learning

2025
[85]

Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. 2024. https://doi.org/10.1145/3748302 A survey on the memory mechanism of large language model based agents . ACM Transactions on Information Systems

work page doi:10.1145/3748302 2024
[86]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html Judging LLM -as-a-judge with MT -bench an...

2023
[87]

Yuxiang Zhou, Jiazheng Li, Yanzheng Xiang, Hanqi Yan, Lin Gui, and Yulan He. 2024. https://aclanthology.org/2024.emnlp-main.795/ The mystery of in-context learning: A comprehensive survey on interpretation and analysis . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

2024
[88]

Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, and J \"u rgen Schmidhuber

Mingchen Zhuge, Changsheng Zhao, Dylan R. Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, and J \"u rgen Schmidhuber. 2025. https://proceedings.mlr.press/v267/zhuge25a.html Agent-as-a-judge: Evaluate agents with agents . In Proceedings of the 42nd Intern...

2025