Recognition: unknown
Towards Self-Improving Error Diagnosis in Multi-Agent Systems
Pith reviewed 2026-05-10 04:44 UTC · model grok-4.3
The pith
ErrorProbe diagnoses errors in multi-agent LLM systems by tracing symptoms backward and validating hypotheses with a tool-using agent team and verified memory.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ErrorProbe attributes semantic failures in multi-agent systems to the responsible agent and originating step through a three-stage pipeline: operationalizing the MAS failure taxonomy to detect local anomalies, performing symptom-driven backward tracing to prune irrelevant context, and employing a specialized multi-agent team to validate error hypotheses through tool-grounded execution. It maintains a verified episodic memory that updates only when error patterns are confirmed by executable evidence, without any annotation requirement.
What carries the argument
The verified episodic memory, which stores only error patterns confirmed by tool-grounded execution from the Strategist-Investigator-Arbiter validation team.
If this is right
- ErrorProbe achieves higher step-level localization accuracy than existing annotation-based or LLM-as-judge methods on the TracerTraj and Who&When benchmarks.
- The verified memory supports robust performance on new domains without any retraining or additional labels.
- The framework removes the need for expensive expert annotations while still producing traceable, evidence-backed diagnoses.
- Confirmed error patterns accumulate over time, enabling progressive self-improvement of the diagnostic capability itself.
Where Pith is reading between the lines
- The same tracing-plus-verified-memory pattern could be adapted to single-agent chains or tool-use failures where errors also manifest after many steps.
- Automated diagnosis of this form might be inserted into agent development loops so that each failure directly triggers targeted fixes in the next iteration.
- If the memory remains clean, the approach could scale to very long traces that current context-window methods cannot handle.
Load-bearing premise
The three-agent validation team can reliably confirm or reject error hypotheses using executable tool checks and the memory will incorporate only genuine patterns without adding false positives or biases.
What would settle it
On the Who&When benchmark, run ErrorProbe with the memory disabled versus enabled and check whether step-level localization accuracy or cross-domain transfer performance drops below the best baseline when memory updates are active.
Figures
read the original abstract
Large Language Model (LLM)-based Multi-Agent Systems (MAS) enable complex problem-solving but introduce significant debugging challenges, characterized by long interaction traces, inter-agent dependencies, and delayed error manifestation. Existing diagnostic approaches often rely on expensive expert annotation or ''LLM-as-a-judge'' paradigms, which struggle to pinpoint decisive error steps within extended contexts. In this paper, we introduce ErrorProbe, a self-improving framework for semantic failure attribution that identifies responsible agents and the originating error step. The framework operates via a three-stage pipeline: (1) operationalizing the MAS failure taxonomy to detect local anomalies, (2) performing symptom-driven backward tracing to prune irrelevant context, and (3) employing a specialized multi-agent team (Strategist, Investigator, Arbiter) to validate error hypotheses through tool-grounded execution. Crucially, ErrorProbe maintains a verified episodic memory that updates only when error patterns are confirmed by executable evidence, without the need for annotation. Experiments across the TracerTraj and Who&When benchmarks demonstrate that ErrorProbe significantly outperforms baselines, particularly in step-level localization, while the verified memory enables robust cross-domain transfer without retraining.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ErrorProbe, a self-improving framework for semantic failure attribution in LLM-based multi-agent systems. It operates via a three-stage pipeline: operationalizing a MAS failure taxonomy to detect local anomalies, symptom-driven backward tracing to prune context, and a specialized multi-agent team (Strategist, Investigator, Arbiter) to validate error hypotheses through tool-grounded execution. A verified episodic memory updates only on patterns confirmed by executable evidence, enabling self-improvement and robust cross-domain transfer without retraining or annotation. Experiments on the TracerTraj and Who&When benchmarks are claimed to demonstrate significant outperformance over baselines, particularly in step-level localization.
Significance. If the reliability of the verification mechanism and the reported empirical gains hold, the work offers a promising direction for automated debugging of complex MAS by reducing dependence on expert labels and enabling memory-based transfer. The emphasis on tool-grounded confirmation and episodic memory updates is a concrete strength that could support reproducible extensions in the field.
major comments (2)
- [§3.3] §3.3 (Multi-agent validation team): The central self-improving claim depends on the episodic memory updating 'only when error patterns are confirmed by executable evidence' via the Strategist-Investigator-Arbiter team. Because confirmation occurs through LLM-driven hypothesis validation and tool calls rather than an external oracle, the mechanism risks false positives from the same class of LLM judgment errors the system aims to diagnose; no explicit mechanism for detecting or bounding such validation errors is described, directly threatening both the 'verified' property and the no-retraining cross-domain transfer.
- [§4] §4 (Experiments and benchmarks): The claims of significant outperformance and cross-domain transfer rest on results from TracerTraj and Who&When, yet the manuscript supplies insufficient detail on baseline implementations, statistical tests, error bars, or how the benchmarks were constructed and how step-level localization was scored. This weakens support for the load-bearing assertion that the verified memory enables robust transfer.
minor comments (2)
- [Abstract] Abstract: the phrase 'significantly outperforms' would benefit from one or two key quantitative metrics to convey the magnitude of improvement immediately.
- [§3] Ensure consistent terminology for 'verified episodic memory' and 'tool-grounded execution' across sections to avoid minor ambiguity in the pipeline description.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on ErrorProbe. The comments highlight important aspects of the verification process and experimental reporting. We respond point-by-point below, clarifying the role of tool-grounded evidence and committing to expanded details where needed.
read point-by-point responses
-
Referee: [§3.3] §3.3 (Multi-agent validation team): The central self-improving claim depends on the episodic memory updating 'only when error patterns are confirmed by executable evidence' via the Strategist-Investigator-Arbiter team. Because confirmation occurs through LLM-driven hypothesis validation and tool calls rather than an external oracle, the mechanism risks false positives from the same class of LLM judgment errors the system aims to diagnose; no explicit mechanism for detecting or bounding such validation errors is described, directly threatening both the 'verified' property and the no-retraining cross-domain transfer.
Authors: The validation in the Strategist-Investigator-Arbiter team is deliberately anchored in executable tool outputs rather than standalone LLM judgments. The Investigator issues tool calls whose results (e.g., concrete return values, error codes, or state changes) constitute external, verifiable evidence; the Arbiter then evaluates these concrete artifacts against the hypothesis. This design distinguishes the process from pure LLM-as-a-judge approaches. Nevertheless, we acknowledge that LLM interpretation of tool outputs could introduce secondary errors. To address this, we will revise §3.3 to describe an explicit bounding step: the Arbiter requires at least two independent tool executions for confirmation and logs any interpretive discrepancies for later review. This addition will reinforce the 'verified' property without altering the core pipeline. revision: yes
-
Referee: [§4] §4 (Experiments and benchmarks): The claims of significant outperformance and cross-domain transfer rest on results from TracerTraj and Who&When, yet the manuscript supplies insufficient detail on baseline implementations, statistical tests, error bars, or how the benchmarks were constructed and how step-level localization was scored. This weakens support for the load-bearing assertion that the verified memory enables robust transfer.
Authors: We agree that additional experimental transparency is required to substantiate the performance and transfer claims. In the revised manuscript we will expand §4 with: (i) complete baseline implementation details, including prompting strategies and any public code references; (ii) results of statistical significance tests (paired t-tests across runs) together with error bars from five independent trials; (iii) explicit descriptions of benchmark construction, including trace generation procedures and ground-truth step annotations; and (iv) the precise scoring protocol for step-level localization (exact-match accuracy with tolerance for adjacent steps). These revisions will directly support the assertion that verified episodic memory enables cross-domain transfer. revision: yes
Circularity Check
No circularity: framework and claims are self-contained via external benchmarks and tool execution
full rationale
The paper introduces ErrorProbe as a three-stage pipeline (taxonomy operationalization, backward tracing, multi-agent hypothesis validation via tool-grounded execution) plus verified episodic memory updated only on confirmed executable evidence. No equations or derivations are present. The central claims rest on benchmark comparisons (TracerTraj, Who&When) and external tool calls rather than any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation. The memory update rule is defined in terms of independent tool execution, not in terms of the system's own outputs or prior results. This satisfies the criteria for a self-contained empirical framework with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption MAS failures can be localized to specific agents and originating steps through anomaly detection and symptom-driven tracing
- domain assumption Tool-grounded execution by the Strategist-Investigator-Arbiter team can confirm or refute error hypotheses without external annotation
invented entities (3)
-
ErrorProbe framework
no independent evidence
-
verified episodic memory
no independent evidence
-
Strategist-Investigator-Arbiter multi-agent team
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Sirui Hong and Mingchen Zhuge and Jonathan Chen and Xiawu Zheng and Yuheng Cheng and Jinlin Wang and Ceyao Zhang and Zili Wang and Steven Ka Shing Yau and Zijuan Lin and Liyang Zhou and Chenyu Ran and Lingfeng Xiao and Chenglin Wu and J. Meta. The Twelfth International Conference on Learning Representations , year=
-
[3]
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , url =
Yao, Shunyu and Chen, Howard and Yang, John and Narasimhan, Karthik , booktitle =. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , url =
-
[4]
2025 , eprint=
Towards an AI co-scientist , author=. 2025 , eprint=
2025
-
[5]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Toolformer: Language Models Can Teach Themselves to Use Tools , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[6]
AutoGen: Enabling Next-Gen
Qingyun Wu and Gagan Bansal and Jieyu Zhang and Yiran Wu and Beibin Li and Erkang Zhu and Li Jiang and Xiaoyun Zhang and Shaokun Zhang and Jiale Liu and Ahmed Hassan Awadallah and Ryen W White and Doug Burger and Chi Wang , booktitle=. AutoGen: Enabling Next-Gen. 2024 , url=
2024
-
[8]
Gonzalez , booktitle=
Shishir G Patil and Tianjun Zhang and Xin Wang and Joseph E. Gonzalez , booktitle=. Gorilla: Large Language Model Connected with Massive. 2024 , url=
2024
-
[9]
Gonzalez and Ion Stoica , booktitle=
Mert Cemri and Melissa Z Pan and Shuyi Yang and Lakshya A Agrawal and Bhavya Chopra and Rishabh Tiwari and Kurt Keutzer and Aditya Parameswaran and Dan Klein and Kannan Ramchandran and Matei Zaharia and Joseph E. Gonzalez and Ion Stoica , booktitle=. Why Do Multi-Agent. 2025 , url=
2025
-
[10]
Which Agent Causes Task Failures and When? On Automated Failure Attribution of
Shaokun Zhang and Ming Yin and Jieyu Zhang and Jiale Liu and Zhiguang Han and Jingyang Zhang and Beibin Li and Chi Wang and Huazheng Wang and Yiran Chen and Qingyun Wu , booktitle=. Which Agent Causes Task Failures and When? On Automated Failure Attribution of. 2025 , url=
2025
-
[11]
and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy
Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024
2024
-
[13]
Yifan Yu and Moyan Li and Shaoyuan Xu and Jinmiao Fu and Xinhai Hou and Fan Lai and Bryan Wang , year=
-
[14]
NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling , year=
Where Did It All Go Wrong? A Hierarchical Look into Multi-Agent Error Attribution , author=. NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling , year=
2025
-
[15]
Proceedings of the 42nd International Conference on Machine Learning , year =
Agent-as-a-Judge: Evaluate Agents with Agents , author =. Proceedings of the 42nd International Conference on Machine Learning , year =
-
[16]
AgenTracer: Who Is Inducing Failure in the
Guibin Zhang and Junhao Wang and Junjie Chen and Wangchunshu Zhou and Kun Wang and Shuicheng YAN , booktitle=. AgenTracer: Who Is Inducing Failure in the. 2026 , url=
2026
-
[18]
2025 , eprint =
Who is Introducing the Failure? Automatically Attributing Failures of Multi-Agent Systems via Spectrum Analysis , author =. 2025 , eprint =
2025
- [19]
-
[20]
and Zhang, Hao and Gonzalez, Joseph E
Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle =. Judging. 2023 , url =
2023
-
[21]
2023 , pages =
Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang , booktitle =. 2023 , pages =
2023
-
[22]
The Twelfth International Conference on Learning Representations , year =
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models , author =. The Twelfth International Conference on Learning Representations , year =
-
[23]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , year =
Bavaresco, Anna and Bernardi, Raffaella and Bertolazzi, Leonardo and Elliott, Desmond and Fern. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , year =
-
[24]
Justice or Prejudice? Quantifying Biases in
Jiayi Ye and Yanbo Wang and Yue Huang and Dongping Chen and Qihui Zhang and Nuno Moniz and Tian Gao and Werner Geyer and Chao Huang and Pin-Yu Chen and Nitesh V Chawla and Xiangliang Zhang , booktitle=. Justice or Prejudice? Quantifying Biases in. 2025 , url=
2025
-
[25]
arXiv preprint arXiv:2410.21819 (2025)
Wataoka, Koki and Takahashi, Tsubasa and Ri, Ryokan , year =. Self-Preference Bias in. 2410.21819 , archivePrefix =
-
[26]
Advances in Neural Information Processing Systems , volume =
Reflexion: Language Agents with Verbal Reinforcement Learning , author =. Advances in Neural Information Processing Systems , volume =. 2023 , url =
2023
-
[27]
Advances in Neural Information Processing Systems , volume =
Self-Refine: Iterative Refinement with Self-Feedback , author =. Advances in Neural Information Processing Systems , volume =. 2023 , url =
2023
-
[28]
Transactions on Machine Learning Research , issn=
Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=
2024
-
[31]
LLM-based Multi-Agent Systems: Towards Responsible, Reliable, and Scalable Agentic Systems , year=
Self-evolving Agents with reflective and memory-augmented abilities , author=. LLM-based Multi-Agent Systems: Towards Responsible, Reliable, and Scalable Agentic Systems , year=
-
[32]
The Fourteenth International Conference on Learning Representations , year=
EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems , author=. The Fourteenth International Conference on Learning Representations , year=
-
[33]
Position: Episodic memory is the missing piece for long-term LLM agents.CoRR, abs/2502.06975,
Pink, Matthew and Ahn, Hyunmin and Evans, Michael M. and Evans, James M. , year =. Episodic Memory is the Missing Piece for Long-Term. 2502.06975 , archivePrefix =
-
[37]
Cut the Crap: An Economical Communication Pipeline for
Guibin Zhang and Yanwei Yue and Zhixun Li and Sukwon Yun and Guancheng Wan and Kun Wang and Dawei Cheng and Jeffrey Xu Yu and Tianlong Chen , booktitle=. Cut the Crap: An Economical Communication Pipeline for. 2025 , url=
2025
-
[38]
Calibrating LLM s with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring
Li, Jiazheng and Xu, Hainiu and Sun, Zhaoyue and Zhou, Yuxiang and West, David and Aloisi, Cesare and He, Yulan. Calibrating LLM s with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024
2024
-
[39]
Position:
Hanqi Yan and Linhai Zhang and Jiazheng Li and Zhenyi Shen and Yulan He , booktitle=. Position:. 2025 , url=
2025
-
[40]
Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time
Li, Jiazheng and Zhou, Yuxiang and Lu, Junru and Tyen, Gladys and Gui, Lin and Aloisi, Cesare and He, Yulan. Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025
2025
-
[41]
The Mystery of In-Context Learning: A Comprehensive Survey on Interpretation and Analysis
Zhou, Yuxiang and Li, Jiazheng and Xiang, Yanzheng and Yan, Hanqi and Gui, Lin and He, Yulan. The Mystery of In-Context Learning: A Comprehensive Survey on Interpretation and Analysis. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024
2024
-
[42]
Drift: Enhancing LLM Faithfulness in Rationale Generation via Dual-Reward Probabilistic Inference
Li, Jiazheng and Yan, Hanqi and He, Yulan. Drift: Enhancing LLM Faithfulness in Rationale Generation via Dual-Reward Probabilistic Inference. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. 2025
2025
-
[43]
2023 , url =
Li, Jiazheng and Sun, Zhaoyue and Liang, Bin and Gui, Lin and He, Yulan , booktitle =. 2023 , url =
2023
-
[44]
Distilling C hat GPT for Explainable Automated Student Answer Assessment
Li, Jiazheng and Gui, Lin and Zhou, Yuxiang and West, David and Aloisi, Cesare and He, Yulan. Distilling C hat GPT for Explainable Automated Student Answer Assessment. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023
2023
-
[45]
2026 , eprint=
Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain , author=. 2026 , eprint=
2026
-
[46]
Adi Banerjee, Anirudh Nair, and Tarik Borogovac. 2025. https://openreview.net/forum?id=0MyUdq7wLe Where did it all go wrong? a hierarchical look into multi-agent error attribution . In NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
2025
-
[47]
Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fern \'a ndez, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, Andr \'e F. T. Martins, Philipp Mondorf, Vera Neplenbroek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece Takmaz, and Alberto Testoni. 202...
2025
-
[48]
Gonzalez, and Ion Stoica
Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. 2025. https://openreview.net/forum?id=fAjbYBmonr Why do multi-agent LLM systems fail? In The Thirty-ninth Annual Conference on Neural Information Proce...
2025
- [49]
- [50]
-
[51]
Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, and 15 others. 2025. https://arxiv.org/abs/2502.18864 Towards an ai co-scie...
work page internal anchor Pith review arXiv 2025
-
[52]
Yufei He, Juncheng Liu, Yue Liu, Yibo Li, Tri Cao, Zhiyuan Hu, Xinxing Xu, and Bryan Hooi. 2026. https://openreview.net/forum?id=JFnnajbkvP Evotest: Evolutionary test-time learning for self-improving agentic systems . In The Fourteenth International Conference on Learning Representations
2026
-
[53]
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J \"u rgen Schmidhuber. 2024. https://openreview.net/forum?id=VtmBAGCN7o Meta GPT : Meta programming for a multi-agent collaborative framework . In The Twelfth ...
2024
-
[54]
Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, and 28 others. 2025. https://arxiv.org/abs/2512.13564 Memory in the age of AI agents . Preprint, arXiv:2512.13564
work page internal anchor Pith review arXiv 2025
-
[55]
Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. 2024. https://openreview.net/forum?id=8euJaTveKw Prometheus: Inducing fine-grained evaluation capability in language models . In The Twelfth International Conference on Learning Representations
2024
-
[56]
Jiazheng Li, Lin Gui, Yuxiang Zhou, David West, Cesare Aloisi, and Yulan He. 2023 a . https://aclanthology.org/2023.findings-emnlp.399/ Distilling C hat GPT for explainable automated student answer assessment . In Findings of the Association for Computational Linguistics: EMNLP 2023
2023
-
[57]
Jiazheng Li, Zhaoyue Sun, Bin Liang, Lin Gui, and Yulan He. 2023 b . https://proceedings.mlr.press/v216/li23d.html CUE : An uncertainty interpretation framework for text classifiers built on pre-trained language models . In Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence
2023
-
[58]
Jiazheng Li, Hainiu Xu, Zhaoyue Sun, Yuxiang Zhou, David West, Cesare Aloisi, and Yulan He. 2024. https://aclanthology.org/2024.findings-emnlp.313/ Calibrating LLM s with preference optimization on thought trees for generating rationale in science question scoring . In Findings of the Association for Computational Linguistics: EMNLP 2024
2024
-
[59]
Jiazheng Li, Hanqi Yan, and Yulan He. 2025 a . https://aclanthology.org/2025.acl-long.340/ Drift: Enhancing LLM faithfulness in rationale generation via dual-reward probabilistic inference . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics
2025
-
[60]
Jiazheng Li, Yuxiang Zhou, Junru Lu, Gladys Tyen, Lin Gui, Cesare Aloisi, and Yulan He. 2025 b . https://aclanthology.org/2025.emnlp-main.155/ Two heads are better than one: Dual-model verbal reflection at inference-time . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
2025
-
[61]
Xuechen Liang, Yangfan He, Yinghui Xia, Xinyuan Song, Meiling Tao, Kuan Lu, Jianhui Wang, Li Sun, Xinhang Yuan, Keqin Li, Jiaqi Chen, TIANYU SHI, and Yang Jingsong. 2026. https://openreview.net/forum?id=6Mw2fO3ejN Self-evolving agents with reflective and memory-augmented abilities . In LLM-based Multi-Agent Systems: Towards Responsible, Reliable, and Scal...
2026
-
[62]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. https://aclanthology.org/2024.tacl-1.9/ Lost in the middle: How language models use long contexts . Transactions of the Association for Computational Linguistics, 12
2024
- [63]
-
[64]
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.153 G -eval: NLG evaluation using GPT -4 with better human alignment . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511--2522
-
[65]
Xuyan Ma, Xiaofei Xie, Yawen Wang, Junjie Wang, Boyu Wu, Mingyang Li, and Qing Wang. 2025. https://doi.org/10.48550/arXiv.2509.23735 Diagnosing failure root causes in platform-orchestrated agentic systems: Dataset, taxonomy, and benchmark . Preprint, arXiv:2509.23735
-
[66]
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abs...
2023
-
[67]
OpenAI . 2024. https://github.com/openai/swarm Swarm
2024
-
[68]
MemGPT: Towards LLMs as Operating Systems
Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Joseph E. Gonzalez, and Ian J. Goodfellow. 2023. https://arxiv.org/abs/2310.08560 MemGPT : Towards LLM s as operating systems . Preprint, arXiv:2310.08560
work page internal anchor Pith review arXiv 2023
-
[69]
Joon Sung Park, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. https://doi.org/10.1145/3586183.3606763 Generative agents: Interactive simulacra of human behavior . In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST '23. ACM
-
[70]
Gonzalez
Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2024. https://openreview.net/forum?id=tBRNC6YemY Gorilla: Large language model connected with massive API s . In The Thirty-eighth Annual Conference on Neural Information Processing Systems
2024
-
[71]
Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. https://doi.org/10.18653/v1/2024.acl-long.810 C hat D ev: Communicative agents for software development . In Proceedings of the 62nd Annual Meeting of the Association for Computational L...
-
[72]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. https://openreview.net/forum?id=Yacmpz84TH Toolformer: Language models can teach themselves to use tools . In Thirty-seventh Conference on Neural Information Processing Systems
2023
-
[73]
Weizhou Shen, Ziyi Yang, Chenliang Li, Zhiyuan Lu, Miao Peng, Huashan Sun, Yingcheng Shi, Shengyi Liao, Shaopeng Lai, Bo Zhang, Dayiheng Liu, Fei Huang, Jingren Zhou, and Ming Yan. 2025. https://arxiv.org/abs/2512.12967 QwenLong-L1.5 : Post-training recipe for long-context reasoning and memory management . Preprint, arXiv:2512.12967
-
[74]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/hash/1b44b878bb782e6954cd888628510e90-Abstract-Conference.html Reflexion: Language agents with verbal reinforcement learning . In Advances in Neural Information Processing Systems, volume 36
2023
-
[75]
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2024. https://openreview.net/forum?id=ehfRiF0R3a Voyager: An open-ended embodied agent with large language models . Transactions on Machine Learning Research
2024
-
[76]
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. 2024. https://openreview.net/forum?id=BAakY1hNKS Autogen: Enabling next-gen LLM applications via multi-agent conversations . In First Conference on Language Modeling
2024
-
[77]
Hanqi Yan, Linhai Zhang, Jiazheng Li, Zhenyi Shen, and Yulan He. 2025. https://openreview.net/forum?id=RrvhbxO2hd Position: LLM s need a bayesian meta-reasoning framework for more robust and generalizable reasoning . In Forty-second International Conference on Machine Learning Position Paper Track
2025
-
[78]
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2022 a . https://proceedings.neurips.cc/paper_files/paper/2022/file/82ad13ec01f9fe44c01cb91814fd7b8c-Paper-Conference.pdf Webshop: Towards scalable real-world web interaction with grounded language agents . In Advances in Neural Information Processing Systems
2022
-
[79]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022 b . React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[80]
Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V Chawla, and Xiangliang Zhang. 2025. https://openreview.net/forum?id=3GTtZFiajM Justice or prejudice? quantifying biases in LLM -as-a-judge . In The Thirteenth International Conference on Learning Representations
2025
-
[81]
Yifan Yu, Moyan Li, Shaoyuan Xu, Jinmiao Fu, Xinhai Hou, Fan Lai, and Bryan Wang. 2026. https://openreview.net/forum?id=6skwd1QtTO CORRECT : CO ndensed error REC ognition via knowledge transfer in multi-agent systems
2026
-
[82]
Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng YAN. 2026. https://openreview.net/forum?id=l05DseqvuD Agentracer: Who is inducing failure in the LLM agentic systems? In The Fourteenth International Conference on Learning Representations
2026
-
[83]
Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Xu Yu, and Tianlong Chen. 2025 a . https://openreview.net/forum?id=LkzuPorQ5L Cut the crap: An economical communication pipeline for LLM -based multi-agent systems . In The Thirteenth International Conference on Learning Representations
2025
-
[84]
Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, and Qingyun Wu. 2025 b . https://openreview.net/forum?id=GazlTYxZss Which agent causes task failures and when? on automated failure attribution of LLM multi-agent systems . In Forty-second International Conference on Machine Learning
2025
-
[85]
Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. 2024. https://doi.org/10.1145/3748302 A survey on the memory mechanism of large language model based agents . ACM Transactions on Information Systems
-
[86]
Xing, Hao Zhang, Joseph E
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html Judging LLM -as-a-judge with MT -bench an...
2023
-
[87]
Yuxiang Zhou, Jiazheng Li, Yanzheng Xiang, Hanqi Yan, Lin Gui, and Yulan He. 2024. https://aclanthology.org/2024.emnlp-main.795/ The mystery of in-context learning: A comprehensive survey on interpretation and analysis . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
2024
-
[88]
Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, and J \"u rgen Schmidhuber
Mingchen Zhuge, Changsheng Zhao, Dylan R. Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, and J \"u rgen Schmidhuber. 2025. https://proceedings.mlr.press/v267/zhuge25a.html Agent-as-a-judge: Evaluate agents with agents . In Proceedings of the 42nd Intern...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.