Recognition: 2 theorem links
Why Do Multi-Agent LLM Systems Fail?
Pith reviewed 2026-05-12 05:38 UTC · model grok-4.3
The pith
Multi-agent LLM systems fail in 14 distinct modes that cluster into design flaws, agent misalignment, and verification gaps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Systematic analysis of failure traces across multiple frameworks and models shows that multi-agent LLM systems exhibit 14 unique failure modes, which cluster into the three categories of system design issues, inter-agent misalignment, and task verification, with high consistency among annotators.
What carries the argument
The Multi-Agent System Failure Taxonomy (MAST), a structured classification of 14 failure modes into three categories that enables consistent identification and analysis of why these systems underperform.
If this is right
- Performance gaps on benchmarks can be reduced by redesigning systems to avoid the identified failure modes.
- The automated LLM judge enables efficient labeling of new traces while maintaining reliability close to human levels.
- Releasing the full dataset allows other researchers to test and extend the taxonomy on additional models and tasks.
- The patterns indicate that current multi-agent approaches need more advanced coordination and verification mechanisms.
Where Pith is reading between the lines
- The taxonomy could serve as a quick diagnostic checklist when testing new multi-agent frameworks before full deployment.
- Agents might be trained to monitor their own interactions for signs of the three main failure categories and self-correct.
- Similar clustering approaches could apply to failure analysis in other distributed AI systems beyond language models.
Load-bearing premise
The 150 traces examined by experts are representative of the full range of failures that occur across different models, tasks, and multi-agent frameworks.
What would settle it
A fresh collection of multi-agent failure traces in which most cases cannot be assigned to any of the 14 modes or in which independent expert annotators show low agreement on the categories.
read the original abstract
Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems. To enable systematic classification of failures for MAST-Data, we build the first Multi-Agent System Failure Taxonomy (MAST). We develop MAST through rigorous analysis of 150 traces, guided closely by expert human annotators and validated by high inter-annotator agreement (kappa = 0.88). This process identifies 14 unique modes, clustered into 3 categories: (i) system design issues, (ii) inter-agent misalignment, and (iii) task verification. To enable scalable annotation, we develop an LLM-as-a-Judge pipeline with high agreement with human annotations. We leverage MAST and MAST-Data to analyze failure patterns across models (GPT4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent), demonstrating improvement headrooms from better MAS design. Our analysis provides insights revealing that identified failures require more sophisticated solutions, highlighting a clear roadmap for future research. We publicly release our comprehensive dataset (MAST-Data), the MAST, and our LLM annotator to facilitate widespread research and development in MAS.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MAST-Data, a dataset of over 1600 annotated failure traces from multi-agent LLM systems (MAS) across 7 frameworks, and the MAST taxonomy of 14 failure modes clustered into three categories (system design issues, inter-agent misalignment, task verification). MAST is constructed bottom-up from expert human analysis of 150 traces (κ=0.88), scaled via an LLM-as-a-Judge pipeline, and used to examine failure patterns across models (GPT-4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent), with public release of the dataset, taxonomy, and annotator to guide future MAS design.
Significance. If the taxonomy is representative and the annotations reliable, the work provides a valuable public resource for systematically understanding MAS failures, which is timely given the minimal benchmark gains often observed. The bottom-up construction with reported agreement metrics, cross-model/task analysis, and open release of data/tools could directly support more robust system design and serve as a foundation for subsequent research.
major comments (2)
- [Taxonomy construction (Section 3 / Methods, as described in abstract)] The construction of MAST relies exclusively on analysis of 150 traces, yet the manuscript provides no description of the sampling procedure (e.g., random, stratified by framework/model/task, or convenience sampling). This is load-bearing for the central claim that the 14 modes and three-category clustering capture dominant failure dynamics, as non-representative selection could omit high-impact modes or over-represent others, undermining both the taxonomy's validity and the utility of MAST-Data for guiding future systems.
- [LLM-as-a-Judge pipeline and dataset annotation (Section 4)] While high inter-annotator agreement (κ=0.88) is reported for the initial 150 traces, the manuscript does not provide quantitative validation metrics (e.g., per-category agreement, confusion matrices, or error rates) for the LLM-as-a-Judge pipeline on the remaining 1450+ traces. This weakens the reliability of the full dataset's failure distributions and cross-model/task analyses.
minor comments (2)
- [Introduction / Related Work] The abstract claims MAST-Data is 'the first' such dataset but the manuscript should include a brief related-work comparison table to substantiate novelty claims regarding prior MAS failure analyses.
- [Dataset description (Section 2)] Exact counts and breakdowns (by framework, model, task) for the 1600+ traces and the 14 modes should be reported in a table for transparency, rather than relying on '1600+' and '14 unique modes' phrasing.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point-by-point below. Where the comments identify gaps in the current description, we will revise the manuscript to incorporate the requested details and strengthen the presentation of our methods and validation.
read point-by-point responses
-
Referee: [Taxonomy construction (Section 3 / Methods, as described in abstract)] The construction of MAST relies exclusively on analysis of 150 traces, yet the manuscript provides no description of the sampling procedure (e.g., random, stratified by framework/model/task, or convenience sampling). This is load-bearing for the central claim that the 14 modes and three-category clustering capture dominant failure dynamics, as non-representative selection could omit high-impact modes or over-represent others, undermining both the taxonomy's validity and the utility of MAST-Data for guiding future systems.
Authors: We agree that an explicit description of the sampling procedure for the 150 expert-annotated traces is necessary to support claims about the taxonomy's coverage. The traces were drawn from the full collection to span the seven frameworks, multiple models, and task categories, prioritizing diversity in observed failure behaviors during initial data exploration. We will revise Section 3 to document the exact selection process, including any stratification or inclusion criteria used, so readers can evaluate representativeness directly. revision: yes
-
Referee: [LLM-as-a-Judge pipeline and dataset annotation (Section 4)] While high inter-annotator agreement (κ=0.88) is reported for the initial 150 traces, the manuscript does not provide quantitative validation metrics (e.g., per-category agreement, confusion matrices, or error rates) for the LLM-as-a-Judge pipeline on the remaining 1450+ traces. This weakens the reliability of the full dataset's failure distributions and cross-model/task analyses.
Authors: We appreciate this observation regarding the need for more granular validation of the LLM-as-a-Judge pipeline. The manuscript notes high agreement with human annotations but does not report per-category metrics, confusion matrices, or error rates on the scaled portion of the data. We will add these quantitative validation results in a revised Section 4, including agreement statistics on a held-out validation subset and any error analysis, to better substantiate the reliability of the full MAST-Data distributions and downstream analyses. revision: yes
Circularity Check
No circularity: taxonomy derived bottom-up from annotated traces
full rationale
The paper builds the MAST taxonomy inductively via human analysis of 150 traces, followed by inter-annotator validation (kappa=0.88) and an LLM-as-Judge pipeline calibrated to those annotations. No equations, fitted parameters, self-referential definitions, or self-citation chains reduce any claim to its own inputs by construction. The 14-mode classification and three-category clustering are presented as empirical outputs from the trace analysis rather than predictions or first-principles results equivalent to the input data. The absence of a described sampling procedure for the 150 traces affects representativeness but does not constitute circularity under the defined criteria.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The 150 traces analyzed by experts are representative of failure patterns across the 7 MAS frameworks and chosen tasks
- domain assumption Expert human annotators can consistently identify and categorize failure modes with high reliability
invented entities (2)
-
Multi-Agent System Failure Taxonomy (MAST) with 14 modes in 3 categories
no independent evidence
-
LLM-as-a-Judge annotation pipeline
no independent evidence
Lean theorems connected to this paper
-
LawOfExistencedefect_zero_iff_one unclearWe develop MAST through rigorous analysis of 150 traces... identifies 14 unique modes, clustered into 3 categories: (i) system design issues, (ii) inter-agent misalignment, and (iii) task verification.
-
HierarchyEmergencehierarchy_emergence_forces_phi unclearMAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems.
Forward citations
Cited by 38 Pith papers
-
Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing
The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.
-
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
-
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.
-
TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples
TraceFix repairs LLM-generated multi-agent protocols via TLA+ counterexamples to achieve full verification on all tested tasks and higher completion rates than prompt-only baselines.
-
TeamBench: Evaluating Agent Coordination under Enforced Role Separation
Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.
-
Improving the Efficiency of Language Agent Teams with Adaptive Task Graphs
LATTE coordinates LLM agent teams with an evolving shared task graph, cutting token use, time, and failures while matching or beating accuracy of MetaGPT, leader-worker, and static methods.
-
Inference-Time Budget Control for LLM Search Agents
A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.
-
Measuring the Unmeasurable: Markov Chain Reliability for LLM Agents
TraceToChain models LLM agent traces as absorbing DTMCs using automatic clustering and smoothed MLE, with KS and AIC validation, to reconcile pass@k, pass^k, and RDC as projections of a single first-passage success-ti...
-
AgentEval: DAG-Structured Step-Level Evaluation for Agentic Workflows with Error Propagation Tracking
AgentEval evaluates agentic workflows via DAGs with step metrics, a 21-category failure taxonomy, and error propagation tracking, yielding 2.17x higher failure recall than end-to-end methods and strong human agreement.
-
Learning to Interrupt in Language-based Multi-agent Communication
HANDRAISER learns optimal interruption points in multi-agent LLM communication using estimated future reward and cost, achieving 32.2% lower communication cost with comparable or better task results across games, sche...
-
How to Interpret Agent Behavior
ACT*ONOMY is a Grounded-Theory-derived hierarchical taxonomy and open repository that enables systematic comparison and characterization of autonomous agent behavior across trajectories.
-
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle
SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.
-
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.
-
PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement
PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.
-
Coordination as an Architectural Layer for LLM-Based Multi-Agent Systems
Coordination treated as a separable architectural layer in LLM multi-agent systems yields distinguishable Murphy-decomposed performance signatures on prediction-market tasks, with some configurations dominating a cost...
-
Trace-Level Analysis of Information Contamination in Multi-Agent Systems
Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.
-
EPM-RL: Reinforcement Learning for On-Premise Product Mapping in E-Commerce
EPM-RL uses PEFT followed by RL with agent-based rewards from judge models to create a trainable in-house product mapping model that improves on fine-tuning alone and beats API baselines in quality-cost while enabling...
-
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
-
Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots
A new benchmark shows LLM smartphone agents achieve comparable success with screen text alone as with screenshots, but both fail often due to UI accessibility and reasoning gaps.
-
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.
-
Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems
LLM agent societies develop power-law coordination cascades and intellectual elites through an integration bottleneck that grows with system size.
-
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
-
AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks
Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.
-
Is a team only as strong as its weakest link? Quantifying the short-board effect with AI Agents
LLM multi-agent simulations reveal a cumulative product effect from multiple weak links on team performance and identify distinct capability regimes including a Sisyphus predicament.
-
Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems
Agentic AI needs social theory as a structural prior, formalized via the MASS dynamical system framework with four priors: strategic heterogeneity, networked-constrained dependence, co-evolution, and distributional in...
-
TRUST: A Framework for Decentralized AI Service v.0.1
TRUST is a decentralized AI auditing framework that decomposes reasoning into HDAGs, maps agent interactions via the DAAN protocol to CIGs, and uses stake-weighted multi-tier consensus to achieve 72.4% accuracy while ...
-
Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems
Sovereign Agentic Loops decouple LLM reasoning from execution by emitting validated intents through a control plane with obfuscation and evidence chains, blocking 93% of unsafe actions in a cloud prototype while addin...
-
Mesh Memory Protocol: Semantic Infrastructure for Multi-Agent LLM Systems
MMP defines a seven-field CMB schema, role-based SVAF evaluation, content-hash lineage, and remix storage to enable traceable cross-session collaboration among autonomous LLM agents.
-
More Is Different: Toward a Theory of Emergence in AI-Native Software Ecosystems
AI-native software ecosystems exhibit emergent behaviors best explained by complex adaptive systems theory, requiring new ecosystem-level monitoring and seven testable propositions that may extend or replace Lehman's laws.
-
Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems
Claude Code centers on a model-tool while-loop surrounded by permission systems, context compaction, extensibility hooks, subagent delegation, and session storage; the same design questions yield different answers in ...
-
Improving Role Consistency in Multi-Agent Collaboration via Quantitative Role Clarity
A role clarity matrix from softmax-normalized behavior-role similarities is employed as a regularizer to enhance role consistency in multi-agent LLM collaborations.
-
Iterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt Engineering Quality Assurance
Nine LLM-agent audit rounds on a 7150-line prompt specification surface found 51 defects with non-monotonic convergence and a post-hoc seven-category taxonomy, showing single-file review misses defect classes.
-
Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems
Agentic AI requires social theory as a structural prior in the proposed MASS framework to model emergent outcomes from agent interactions and influence.
-
Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems
Agentic AI needs social theory as structural priors in the MASS framework to model emergent dynamics from multi-agent interactions.
-
Agentic Microphysics: A Manifesto for Generative AI Safety
The authors introduce agentic microphysics and generative safety to link local agent interactions to population-level risks in agentic AI through a causally explicit framework.
-
Conversations Risk Detection LLMs in Financial Agents via Multi-Stage Generative Rollout
FinSec is a multi-stage detection system for financial LLM dialogues that reaches 90.13% F1 score, cuts attack success rate to 9.09%, and raises AUPRC to 0.9189.
-
Qualixar OS: A Universal Operating System for AI Agent Orchestration
Qualixar OS provides a runtime for multi-agent AI systems with support for 12 topologies, LLM-driven team design, dynamic routing, consensus judging, content attribution, and protocol bridging, achieving 100% accuracy...
-
Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation
Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.
Reference graph
Works this paper leans on
- [1]
-
[2]
Gorilla: Large Language Model Connected with Massive APIs
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis, 2023. URLhttps://arxiv.org/abs/2305.15334
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
MemGPT: Towards LLMs as Operating Systems
Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems, 2024. URL https: //arxiv.org/abs/2310.08560
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), March
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), March
-
[5]
A survey on large language model based autonomous agents,
ISSN 2095-2236. doi: 10.1007/s11704-024-40231-1. URL http://dx.doi.org/10. 1007/s11704-024-40231-1
-
[6]
ChatDev: Communicative Agents for Software Development
Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Chatdev: Communicative agents for software development.arXiv preprint arXiv:2307.07924, 2023. URL https://arxiv.org/abs/2307.07924
work page internal anchor Pith review arXiv 2023
-
[7]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Kyle Swanson, Wesley Wu, Nash L. Bulaong, John E. Pak, and James Zou. The virtual lab: Ai agents design new sars-cov-2 nanobodies with experimental validation.bioRxiv, 2024. doi: 10. 1101/2024.11.11.623004. URL https://www.biorxiv.org/content/early/2024/11/ 12/2024.11.11.623004
work page 2024
-
[10]
Generative Agents: Interactive Simulacra of Human Behavior
Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023. URL https://arxiv.org/abs/2304.03442
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Openmanus: An open- source framework for building general ai agents
Xinbin Liang, Jinyu Xiang, Zhaoyang Yu, Jiayi Zhang, and Sirui Hong. Openmanus: An open- source framework for building general ai agents. https://github.com/mannaandpoem/ OpenManus, 2025
work page 2025
-
[12]
Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, et al. Magentic-one: A generalist multi-agent system for solving complex tasks.arXiv preprint arXiv:2411.04468, 2024
-
[13]
Junda He, Christoph Treude, and David Lo. Llm-based multi-agent systems for software engineering: Vision and the road ahead, 2024. URL https://arxiv.org/abs/2404.04834
-
[14]
Roco: Dialectic multi-robot collaboration with large language models
Zhao Mandi, Shreeya Jain, and Shuran Song. Roco: Dialectic multi-robot collaboration with large language models, 2023. URLhttps://arxiv.org/abs/2307.04738
-
[15]
arXiv preprint arXiv:2307.02485
Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B. Tenenbaum, Tianmin Shu, and Chuang Gan. Building cooperative embodied agents modularly with large language models, 2024. URLhttps://arxiv.org/abs/2307.02485. 11
-
[16]
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate, 2023. URL https: //arxiv.org/abs/2305.14325
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Generative agents: Interactive simulacra of human behavior
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023
work page 2023
-
[18]
Large Language Model based Multi-Agents: A Survey of Progress and Challenges
Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680, 2024
work page internal anchor Pith review arXiv 2024
-
[19]
Agentless: Demystifying LLM-based Software Engineering Agents
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents, 2024. URL https://arxiv.org/abs/2407.01489
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. Ai agents that matter, 2024. URLhttps://arxiv.org/abs/2407.01502
-
[21]
Barney G. Glaser and Anselm L. Strauss.The Discovery of Grounded Theory: Strategies for Qualitative Research. Aldine Publishing Company, 1967
work page 1967
-
[22]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https://arxiv.org/ abs/2306.05685
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory,
- [24]
-
[25]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines, 2023. URLhttps://arxiv.org/abs/2310.03714
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Stateflow: Enhancing llm task-solving through state-driven workflows, 2024
Yiran Wu, Tianwei Yue, Shaokun Zhang, Chi Wang, and Qingyun Wu. Stateflow: Enhancing llm task-solving through state-driven workflows, 2024. URLhttps://arxiv.org/abs/2403. 11322
work page 2024
-
[27]
Shanshan Han, Qifan Zhang, Yuhang Yao, Weizhao Jin, Zhaozhuo Xu, and Chaoyang He. Llm multi-agent systems: Challenges and open problems, 2024. URL https://arxiv.org/abs/ 2402.03578
-
[28]
Lewis Hammond, Alan Chan, Jesse Clifton, Jason Hoelscher-Obermaier, Akbir Khan, Euan McLean, Chandler Smith, Wolfram Barfuss, Jakob Foerster, Tom ´aˇs Gaven ˇciak, The Anh Han, Edward Hughes, V ojtˇech Kovaˇr´ık, Jan Kulveit, Joel Z. Leibo, Caspar Oesterheld, Chris- tian Schroeder de Witt, Nisarg Shah, Michael Wellman, Paolo Bova, Theodor Cimpeanu, Carson...
-
[29]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=VTF8yNQM66
work page 2024
-
[30]
A survey of useful llm evaluation.arXiv preprint arXiv:2406.00936, 2024
Ji-Lun Peng, Sijia Cheng, Egil Diau, Yung-Yu Shih, Po-Heng Chen, Yen-Ting Lin, and Yun- Nung Chen. A survey of useful llm evaluation.arXiv preprint arXiv:2406.00936, 2024. 12
-
[31]
Wei Wang, Dan Zhang, Tao Feng, Boyan Wang, and Jie Tang. Battleagentbench: A benchmark for evaluating cooperation and competition capabilities of language models in multi-agent systems.arXiv preprint arXiv:2408.15971, 2024
-
[32]
Timoth´ee Anne, Noah Syrkis, Meriem Elhosni, Florian Turati, Franck Legendre, Alain Jaquier, and Sebastian Risi. Harnessing language for coordination: A framework and benchmark for llm-driven multi-agent control.arXiv preprint arXiv:2412.11761, 2024
-
[33]
Matteo Bettini, Amanda Prorok, and Vincent Moens. Benchmarl: Benchmarking multi-agent reinforcement learning.Journal of Machine Learning Research, 25(217):1–10, 2024
work page 2024
-
[34]
Qian Long, Zhi Li, Ran Gong, Ying Nian Wu, Demetri Terzopoulos, and Xiaofeng Gao. Teamcraft: A benchmark for multi-modal multi-agent systems in minecraft.arXiv preprint arXiv:2412.05255, 2024
-
[35]
Lu, S., Wang, Y ., Sheng, L., He, L., Zheng, A., and Liang, J
Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. Trustworthy llms: A survey and guideline for evaluating large language models’ alignment.arXiv preprint arXiv:2308.05374, 2023
-
[36]
Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly.High-Confidence Computing, page 100211, 2024
work page 2024
-
[37]
URL https://www.anthropic.com/research/ building-effective-agents
Anthropic, Dec 2024. URL https://www.anthropic.com/research/ building-effective-agents
work page 2024
-
[38]
Ion Stoica, Matei Zaharia, Joseph Gonzalez, Ken Goldberg, Hao Zhang, Anastasios Angelopou- los, Shishir G Patil, Lingjiao Chen, Wei-Lin Chiang, and Jared Q Davis. Specifications: The missing link to making the development of llm systems an engineering discipline.arXiv preprint arXiv:2412.05299, 2024
-
[39]
Gagan Bansal, Jennifer Wortman Vaughan, Saleema Amershi, Eric Horvitz, Adam Fourney, Hussein Mozannar, Victor Dibia, and Daniel S. Weld. Challenges in human-agent communication. Technical Report MSR-TR-2024-53, Microsoft, De- cember 2024. URL https://www.microsoft.com/en-us/research/publication/ human-agent-interaction-challenges/
work page 2024
-
[40]
Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page 74...
-
[41]
An empirical study of code generation errors made by large language models
Song Da, Zijie Zhou, Zhijie Wang, Yuheng Huang, Shengmai Chen, Bonan Kou, Lei Ma, and Tianyi Zhang. An empirical study of code generation errors made by large language models. In In 7th Annual Symposium on Machine Programming, 2023
work page 2023
- [42]
-
[43]
Interactive debugging and steering of multi-agent ai systems
Will Epperson, Gagan Bansal, Victor Dibia, Adam Fourney, Jack Gerrits, Erkang (Eric) Zhu, and Saleema Amershi. Interactive debugging and steering of multi-agent ai systems. InCHI 2025, April 2025. URLhttps://arxiv.org/abs/2503.02068
-
[44]
Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, and Qingyun Wu. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems, 2025. URL https://arxiv.org/abs/2505.00212. 13
-
[45]
Claire B Draucker, Donna S Martsolf, Ratchneewan Ross, and Thomas B Rusk. Theoretical sampling and category development in grounded theory.Qualitative health research, 17(8): 1137–1148, 2007
work page 2007
-
[46]
Open coding.University of Calgary, 23(2009):2009, 2009
Shahedul Huq Khandkar. Open coding.University of Calgary, 23(2009):2009, 2009
work page 2009
- [47]
-
[48]
Model context protocol: Introduction
Anthropic. Model context protocol: Introduction. https://modelcontextprotocol.io/ introduction, dec 2024
work page 2024
-
[49]
A2a: A new era of agent interoperability, April 2025
Rao Surapaneni, Miku Jha, Michael Vakoc, and Todd Segal. A2a: A new era of agent interoperability, April 2025. URL https://developers.googleblog.com/en/ a2a-a-new-era-of-agent-interoperability/. Google Developers Blog
work page 2025
-
[50]
Saaket Agashe, Yue Fan, Anthony Reyna, and Xin Eric Wang. Llm-coordination: Evaluating and analyzing multi-agent coordination abilities in large language models, 2025. URL https: //arxiv.org/abs/2310.03903
-
[51]
Princeton University Press, Princeton, NJ, 1984
Charles Perrow.Normal Accidents: Living with High-Risk Technologies. Princeton University Press, Princeton, NJ, 1984. ISBN 978-0691004129
work page 1984
-
[52]
Karlene H. Roberts. New challenges in organizational research: High reliability organizations. Organization & Environment, 3(2):111–125, 1989. doi: 10.1177/108602668900300202
-
[53]
Gene I Rochlin. Reliable organizations: Present research and future directions.Journal of contingencies and crisis management., 4(2), 1996. ISSN 0966-0879
work page 1996
-
[54]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework.arXiv preprint arXiv:2308.00352, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
Huy Nhat Phan, Tien N Nguyen, Phong X Nguyen, and Nghi DQ Bui. Hyperagent: Generalist software engineering agents to solve coding tasks at scale.arXiv preprint arXiv:2409.16299, 2024
-
[56]
Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A control- lable world of apps and people for benchmarking interactive coding agents.arXiv preprint arXiv:2407.18901, 2024
-
[57]
Autogen: Enabling next-gen llm applications via multi-agent conversations
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst Conference on Language Modeling, 2024
work page 2024
-
[58]
Chatdev: Communicative agents for software development
Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15174–15186, 2024
work page 2024
-
[59]
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework.arXiv preprint arXiv:2308.08155, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Does prompt formatting have any impact on llm performance?
Jia He, Mukund Rungta, David Koleczek, Arshdeep Sekhon, Franklin X Wang, and Sadid Hasan. Does prompt formatting have any impact on llm performance?arXiv preprint arXiv:2411.10541, 2024
-
[61]
Yashar Talebirad and Amirhossein Nadiri. Multi-agent collaboration: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314, 2023
-
[62]
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023. 14
work page internal anchor Pith review arXiv 2023
-
[63]
Large language models are better reasoners with self-verification
Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. Large language models are better reasoners with self-verification. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023
work page 2023
- [64]
-
[65]
Building effective agents, 2024
Anthropic. Building effective agents, 2024. URLhttps://www.anthropic.com/research/ building-effective-agents
work page 2024
-
[66]
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[67]
Fatemeh Haji, Mazal Bethany, Maryam Tabar, Jason Chiang, Anthony Rios, and Peyman Najafirad. Improving llm reasoning with multi-agent tree-of-thought validator agent.arXiv preprint arXiv:2409.11527, 2024
-
[68]
Zhenran Xu, Senbao Shi, Baotian Hu, Jindi Yu, Dongfang Li, Min Zhang, and Yuxiang Wu. Towards reasoning in large language models via multi-agent peer review collaboration.arXiv preprint arXiv:2311.08152, 2023
-
[69]
Benedikt Stroebl, Sayash Kapoor, and Arvind Narayanan. Inference scaling f laws: The limits of llm resampling with imperfect verifiers.arXiv preprint arXiv:2411.17501, 2024
-
[70]
Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei Zaharia, and James Zou. Are more llm calls all you need? towards scaling laws of compound inference systems.arXiv preprint arXiv:2403.02419, 2024
-
[71]
Kush Jain, Gabriel Synnaeve, and Baptiste Rozi `ere. Testgeneval: A real world unit test generation and test completion benchmark.arXiv preprint arXiv:2410.00752, 2024
-
[72]
Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, et al. Check your facts and try again: Improving large language models with external knowledge and automated feedback.arXiv preprint arXiv:2302.12813, 2023
-
[73]
Pavan Kapanipathi, Ibrahim Abdelaziz, Srinivas Ravishankar, Salim Roukos, Alexander Gray, Ramon Astudillo, Maria Chang, Cristina Cornelio, Saswati Dana, Achille Fokoue, et al. Ques- tion answering over knowledge bases by leveraging semantic parsing and neuro-symbolic reasoning.arXiv preprint arXiv:2012.01707, 2020
-
[74]
Xinyi Li, Sai Wang, Siqi Zeng, Yu Wu, and Yi Yang. A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024
work page 2024
-
[75]
Multi-agent graph-attention communica- tion and teaming
Yaru Niu, Rohan R Paleja, and Matthew C Gombolay. Multi-agent graph-attention communica- tion and teaming. InAAMAS, volume 21, page 20th, 2021
work page 2021
-
[76]
Jiechuan Jiang and Zongqing Lu. Learning attentional communication for multi-agent coopera- tion.Advances in neural information processing systems, 31, 2018
work page 2018
-
[77]
Amanpreet Singh, Tushar Jain, and Sainbayar Sukhbaatar. Learning when to communicate at scale in multiagent cooperative and competitive tasks.arXiv preprint arXiv:1812.09755, 2018
-
[78]
Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of ppo in cooperative multi-agent games.Advances in Neural Information Processing Systems, 35:24611–24624, 2022
work page 2022
-
[79]
Xudong Guo, Daming Shi, Junjie Yu, and Wenhui Fan. Heterogeneous multi-agent reinforce- ment learning for zero-shot scalable collaboration.arXiv preprint arXiv:2404.03869, 2024
-
[80]
Weize Chen, Jiarui Yuan, Chen Qian, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Optima: Optimizing effectiveness and efficiency for llm-based multi-agent system.arXiv preprint arXiv:2410.08115, 2024. 15
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.