pith. machine review for the scientific record. sign in

arxiv: 2503.13657 · v3 · submitted 2025-03-17 · 💻 cs.AI

Recognition: 2 theorem links

Why Do Multi-Agent LLM Systems Fail?

Aditya Parameswaran, Bhavya Chopra, Dan Klein, Ion Stoica, Joseph E. Gonzalez, Kannan Ramchandran, Kurt Keutzer, Lakshya A. Agrawal, Matei Zaharia, Melissa Z. Pan, Mert Cemri, Rishabh Tiwari, Shuyi Yang

Pith reviewed 2026-05-12 05:38 UTC · model grok-4.3

classification 💻 cs.AI
keywords multi-agent systemsLLM failuresfailure taxonomyagent misalignmentsystem designtask verificationLLM evaluation
0
0 comments X

The pith

Multi-agent LLM systems fail in 14 distinct modes that cluster into design flaws, agent misalignment, and verification gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper collects over 1600 annotated failure traces from multi-agent LLM systems across seven popular frameworks. By examining 150 of these traces with expert annotators, it builds a taxonomy that sorts failures into 14 modes. These modes fall into three groups: issues with overall system design, cases where agents fail to coordinate properly, and problems in confirming that the task is complete. The work also creates an automated labeling method using LLMs that matches human judgments closely. The dataset, taxonomy, and labeling tool are released to support targeted improvements in future multi-agent setups.

Core claim

Systematic analysis of failure traces across multiple frameworks and models shows that multi-agent LLM systems exhibit 14 unique failure modes, which cluster into the three categories of system design issues, inter-agent misalignment, and task verification, with high consistency among annotators.

What carries the argument

The Multi-Agent System Failure Taxonomy (MAST), a structured classification of 14 failure modes into three categories that enables consistent identification and analysis of why these systems underperform.

If this is right

  • Performance gaps on benchmarks can be reduced by redesigning systems to avoid the identified failure modes.
  • The automated LLM judge enables efficient labeling of new traces while maintaining reliability close to human levels.
  • Releasing the full dataset allows other researchers to test and extend the taxonomy on additional models and tasks.
  • The patterns indicate that current multi-agent approaches need more advanced coordination and verification mechanisms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The taxonomy could serve as a quick diagnostic checklist when testing new multi-agent frameworks before full deployment.
  • Agents might be trained to monitor their own interactions for signs of the three main failure categories and self-correct.
  • Similar clustering approaches could apply to failure analysis in other distributed AI systems beyond language models.

Load-bearing premise

The 150 traces examined by experts are representative of the full range of failures that occur across different models, tasks, and multi-agent frameworks.

What would settle it

A fresh collection of multi-agent failure traces in which most cases cannot be assigned to any of the 14 modes or in which independent expert annotators show low agreement on the categories.

read the original abstract

Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems. To enable systematic classification of failures for MAST-Data, we build the first Multi-Agent System Failure Taxonomy (MAST). We develop MAST through rigorous analysis of 150 traces, guided closely by expert human annotators and validated by high inter-annotator agreement (kappa = 0.88). This process identifies 14 unique modes, clustered into 3 categories: (i) system design issues, (ii) inter-agent misalignment, and (iii) task verification. To enable scalable annotation, we develop an LLM-as-a-Judge pipeline with high agreement with human annotations. We leverage MAST and MAST-Data to analyze failure patterns across models (GPT4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent), demonstrating improvement headrooms from better MAS design. Our analysis provides insights revealing that identified failures require more sophisticated solutions, highlighting a clear roadmap for future research. We publicly release our comprehensive dataset (MAST-Data), the MAST, and our LLM annotator to facilitate widespread research and development in MAS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MAST-Data, a dataset of over 1600 annotated failure traces from multi-agent LLM systems (MAS) across 7 frameworks, and the MAST taxonomy of 14 failure modes clustered into three categories (system design issues, inter-agent misalignment, task verification). MAST is constructed bottom-up from expert human analysis of 150 traces (κ=0.88), scaled via an LLM-as-a-Judge pipeline, and used to examine failure patterns across models (GPT-4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent), with public release of the dataset, taxonomy, and annotator to guide future MAS design.

Significance. If the taxonomy is representative and the annotations reliable, the work provides a valuable public resource for systematically understanding MAS failures, which is timely given the minimal benchmark gains often observed. The bottom-up construction with reported agreement metrics, cross-model/task analysis, and open release of data/tools could directly support more robust system design and serve as a foundation for subsequent research.

major comments (2)
  1. [Taxonomy construction (Section 3 / Methods, as described in abstract)] The construction of MAST relies exclusively on analysis of 150 traces, yet the manuscript provides no description of the sampling procedure (e.g., random, stratified by framework/model/task, or convenience sampling). This is load-bearing for the central claim that the 14 modes and three-category clustering capture dominant failure dynamics, as non-representative selection could omit high-impact modes or over-represent others, undermining both the taxonomy's validity and the utility of MAST-Data for guiding future systems.
  2. [LLM-as-a-Judge pipeline and dataset annotation (Section 4)] While high inter-annotator agreement (κ=0.88) is reported for the initial 150 traces, the manuscript does not provide quantitative validation metrics (e.g., per-category agreement, confusion matrices, or error rates) for the LLM-as-a-Judge pipeline on the remaining 1450+ traces. This weakens the reliability of the full dataset's failure distributions and cross-model/task analyses.
minor comments (2)
  1. [Introduction / Related Work] The abstract claims MAST-Data is 'the first' such dataset but the manuscript should include a brief related-work comparison table to substantiate novelty claims regarding prior MAS failure analyses.
  2. [Dataset description (Section 2)] Exact counts and breakdowns (by framework, model, task) for the 1600+ traces and the 14 modes should be reported in a table for transparency, rather than relying on '1600+' and '14 unique modes' phrasing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point-by-point below. Where the comments identify gaps in the current description, we will revise the manuscript to incorporate the requested details and strengthen the presentation of our methods and validation.

read point-by-point responses
  1. Referee: [Taxonomy construction (Section 3 / Methods, as described in abstract)] The construction of MAST relies exclusively on analysis of 150 traces, yet the manuscript provides no description of the sampling procedure (e.g., random, stratified by framework/model/task, or convenience sampling). This is load-bearing for the central claim that the 14 modes and three-category clustering capture dominant failure dynamics, as non-representative selection could omit high-impact modes or over-represent others, undermining both the taxonomy's validity and the utility of MAST-Data for guiding future systems.

    Authors: We agree that an explicit description of the sampling procedure for the 150 expert-annotated traces is necessary to support claims about the taxonomy's coverage. The traces were drawn from the full collection to span the seven frameworks, multiple models, and task categories, prioritizing diversity in observed failure behaviors during initial data exploration. We will revise Section 3 to document the exact selection process, including any stratification or inclusion criteria used, so readers can evaluate representativeness directly. revision: yes

  2. Referee: [LLM-as-a-Judge pipeline and dataset annotation (Section 4)] While high inter-annotator agreement (κ=0.88) is reported for the initial 150 traces, the manuscript does not provide quantitative validation metrics (e.g., per-category agreement, confusion matrices, or error rates) for the LLM-as-a-Judge pipeline on the remaining 1450+ traces. This weakens the reliability of the full dataset's failure distributions and cross-model/task analyses.

    Authors: We appreciate this observation regarding the need for more granular validation of the LLM-as-a-Judge pipeline. The manuscript notes high agreement with human annotations but does not report per-category metrics, confusion matrices, or error rates on the scaled portion of the data. We will add these quantitative validation results in a revised Section 4, including agreement statistics on a held-out validation subset and any error analysis, to better substantiate the reliability of the full MAST-Data distributions and downstream analyses. revision: yes

Circularity Check

0 steps flagged

No circularity: taxonomy derived bottom-up from annotated traces

full rationale

The paper builds the MAST taxonomy inductively via human analysis of 150 traces, followed by inter-annotator validation (kappa=0.88) and an LLM-as-Judge pipeline calibrated to those annotations. No equations, fitted parameters, self-referential definitions, or self-citation chains reduce any claim to its own inputs by construction. The 14-mode classification and three-category clustering are presented as empirical outputs from the trace analysis rather than predictions or first-principles results equivalent to the input data. The absence of a described sampling procedure for the 150 traces affects representativeness but does not constitute circularity under the defined criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claims rest on the representativeness of the collected traces and the reliability of expert labeling; no free parameters are fitted to data, and no new physical or mathematical entities are postulated.

axioms (2)
  • domain assumption The 150 traces analyzed by experts are representative of failure patterns across the 7 MAS frameworks and chosen tasks
    This assumption underpins generalization from the initial analysis to the full 1600+ trace dataset and the resulting taxonomy.
  • domain assumption Expert human annotators can consistently identify and categorize failure modes with high reliability
    Invoked to justify the kappa=0.88 agreement and the use of the taxonomy as a stable reference.
invented entities (2)
  • Multi-Agent System Failure Taxonomy (MAST) with 14 modes in 3 categories no independent evidence
    purpose: To systematically classify observed failures for analysis and future system design
    Newly constructed from the 150-trace analysis; no independent falsifiable prediction is provided.
  • LLM-as-a-Judge annotation pipeline no independent evidence
    purpose: To scale labeling of the remaining traces while matching human judgments
    Developed as part of the work; agreement with humans is asserted but not quantified in the abstract.

pith-pipeline@v0.9.0 · 5633 in / 1537 out tokens · 37064 ms · 2026-05-12T05:38:36.140360+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • LawOfExistence defect_zero_iff_one unclear

    We develop MAST through rigorous analysis of 150 traces... identifies 14 unique modes, clustered into 3 categories: (i) system design issues, (ii) inter-agent misalignment, and (iii) task verification.

  • HierarchyEmergence hierarchy_emergence_forces_phi unclear

    MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems.

Forward citations

Cited by 38 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing

    cs.CR 2026-04 unverdicted novelty 8.0

    The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.

  2. AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

    cs.SE 2026-05 conditional novelty 7.0

    10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.

  3. AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

    cs.CL 2026-05 unverdicted novelty 7.0

    AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.

  4. TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples

    cs.AI 2026-05 conditional novelty 7.0

    TraceFix repairs LLM-generated multi-agent protocols via TLA+ counterexamples to achieve full verification on all tested tasks and higher completion rates than prompt-only baselines.

  5. TeamBench: Evaluating Agent Coordination under Enforced Role Separation

    cs.AI 2026-05 unverdicted novelty 7.0

    Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.

  6. Improving the Efficiency of Language Agent Teams with Adaptive Task Graphs

    cs.MA 2026-05 unverdicted novelty 7.0

    LATTE coordinates LLM agent teams with an evolving shared task graph, cutting token use, time, and failures while matching or beating accuracy of MetaGPT, leader-worker, and static methods.

  7. Inference-Time Budget Control for LLM Search Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.

  8. Measuring the Unmeasurable: Markov Chain Reliability for LLM Agents

    cs.SE 2026-04 unverdicted novelty 7.0

    TraceToChain models LLM agent traces as absorbing DTMCs using automatic clustering and smoothed MLE, with KS and AIC validation, to reconcile pass@k, pass^k, and RDC as projections of a single first-passage success-ti...

  9. AgentEval: DAG-Structured Step-Level Evaluation for Agentic Workflows with Error Propagation Tracking

    cs.SE 2026-04 conditional novelty 7.0

    AgentEval evaluates agentic workflows via DAGs with step metrics, a 21-category failure taxonomy, and error propagation tracking, yielding 2.17x higher failure recall than end-to-end methods and strong human agreement.

  10. Learning to Interrupt in Language-based Multi-agent Communication

    cs.CL 2026-04 unverdicted novelty 7.0

    HANDRAISER learns optimal interruption points in multi-agent LLM communication using estimated future reward and cost, achieving 32.2% lower communication cost with comparable or better task results across games, sche...

  11. How to Interpret Agent Behavior

    cs.AI 2026-05 conditional novelty 6.0

    ACT*ONOMY is a Grounded-Theory-derived hierarchical taxonomy and open repository that enables systematic comparison and characterization of autonomous agent behavior across trajectories.

  12. SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle

    cs.SE 2026-05 unverdicted novelty 6.0

    SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.

  13. Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

    cs.LG 2026-05 unverdicted novelty 6.0

    Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.

  14. PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement

    cs.AI 2026-05 unverdicted novelty 6.0

    PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.

  15. Coordination as an Architectural Layer for LLM-Based Multi-Agent Systems

    cs.MA 2026-05 unverdicted novelty 6.0

    Coordination treated as a separable architectural layer in LLM multi-agent systems yields distinguishable Murphy-decomposed performance signatures on prediction-market tasks, with some configurations dominating a cost...

  16. Trace-Level Analysis of Information Contamination in Multi-Agent Systems

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.

  17. EPM-RL: Reinforcement Learning for On-Premise Product Mapping in E-Commerce

    cs.CL 2026-04 unverdicted novelty 6.0

    EPM-RL uses PEFT followed by RL with agent-based rewards from judge models to create a trainable in-house product mapping model that improves on fine-tuning alone and beats API baselines in quality-cost while enabling...

  18. VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

    cs.CL 2026-04 conditional novelty 6.0

    VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.

  19. Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots

    cs.HC 2026-04 unverdicted novelty 6.0

    A new benchmark shows LLM smartphone agents achieve comparable success with screen text alone as with screenshots, but both fail often due to UI accessibility and reasoning gaps.

  20. Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

    cs.AI 2026-04 conditional novelty 6.0

    Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.

  21. Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems

    cs.MA 2026-04 unverdicted novelty 6.0

    LLM agent societies develop power-law coordination cascades and intellectual elites through an integration bottleneck that grows with system size.

  22. Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

    cs.AI 2026-05 unverdicted novelty 5.0 partial

    Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.

  23. AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks

    cs.AI 2026-05 unverdicted novelty 5.0

    Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.

  24. Is a team only as strong as its weakest link? Quantifying the short-board effect with AI Agents

    physics.soc-ph 2026-05 unverdicted novelty 5.0

    LLM multi-agent simulations reveal a cumulative product effect from multiple weak links on team performance and identify distinct capability regimes including a Sisyphus predicament.

  25. Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems

    cs.MA 2026-05 unverdicted novelty 5.0

    Agentic AI needs social theory as a structural prior, formalized via the MASS dynamical system framework with four priors: strategic heterogeneity, networked-constrained dependence, co-evolution, and distributional in...

  26. TRUST: A Framework for Decentralized AI Service v.0.1

    cs.AI 2026-04 unverdicted novelty 5.0

    TRUST is a decentralized AI auditing framework that decomposes reasoning into HDAGs, maps agent interactions via the DAAN protocol to CIGs, and uses stake-weighted multi-tier consensus to achieve 72.4% accuracy while ...

  27. Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems

    cs.CR 2026-04 unverdicted novelty 5.0

    Sovereign Agentic Loops decouple LLM reasoning from execution by emitting validated intents through a control plane with obfuscation and evidence chains, blocking 93% of unsafe actions in a cloud prototype while addin...

  28. Mesh Memory Protocol: Semantic Infrastructure for Multi-Agent LLM Systems

    cs.MA 2026-04 unverdicted novelty 5.0

    MMP defines a seven-field CMB schema, role-based SVAF evaluation, content-hash lineage, and remix storage to enable traceable cross-session collaboration among autonomous LLM agents.

  29. More Is Different: Toward a Theory of Emergence in AI-Native Software Ecosystems

    cs.SE 2026-04 unverdicted novelty 5.0

    AI-native software ecosystems exhibit emergent behaviors best explained by complex adaptive systems theory, requiring new ecosystem-level monitoring and seven testable propositions that may extend or replace Lehman's laws.

  30. Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

    cs.SE 2026-04 unverdicted novelty 5.0

    Claude Code centers on a model-tool while-loop surrounded by permission systems, context compaction, extensibility hooks, subagent delegation, and session storage; the same design questions yield different answers in ...

  31. Improving Role Consistency in Multi-Agent Collaboration via Quantitative Role Clarity

    cs.AI 2026-04 conditional novelty 5.0

    A role clarity matrix from softmax-normalized behavior-role similarities is employed as a regularizer to enhance role consistency in multi-agent LLM collaborations.

  32. Iterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt Engineering Quality Assurance

    cs.SE 2026-05 conditional novelty 4.0

    Nine LLM-agent audit rounds on a 7150-line prompt specification surface found 51 defects with non-monotonic convergence and a post-hoc seven-category taxonomy, showing single-file review misses defect classes.

  33. Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems

    cs.MA 2026-05 unverdicted novelty 4.0

    Agentic AI requires social theory as a structural prior in the proposed MASS framework to model emergent outcomes from agent interactions and influence.

  34. Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems

    cs.MA 2026-05 unverdicted novelty 4.0

    Agentic AI needs social theory as structural priors in the MASS framework to model emergent dynamics from multi-agent interactions.

  35. Agentic Microphysics: A Manifesto for Generative AI Safety

    cs.CY 2026-04 unverdicted novelty 4.0

    The authors introduce agentic microphysics and generative safety to link local agent interactions to population-level risks in agentic AI through a causally explicit framework.

  36. Conversations Risk Detection LLMs in Financial Agents via Multi-Stage Generative Rollout

    cs.CR 2026-04 unverdicted novelty 4.0

    FinSec is a multi-stage detection system for financial LLM dialogues that reaches 90.13% F1 score, cuts attack success rate to 9.09%, and raises AUPRC to 0.9189.

  37. Qualixar OS: A Universal Operating System for AI Agent Orchestration

    cs.AI 2026-04 unverdicted novelty 4.0

    Qualixar OS provides a runtime for multi-agent AI systems with support for 12 topologies, LLM-driven team design, dynamic routing, consensus judging, content attribution, and protocol bridging, achieving 100% accuracy...

  38. Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation

    cs.SE 2026-04 unverdicted novelty 4.0

    Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.

Reference graph

Works this paper leans on

123 extracted references · 123 canonical work pages · cited by 36 Pith papers · 17 internal anchors

  1. [1]

    The Russian Messenger, 1878

    Leo Tolstoy.Anna Karenina. The Russian Messenger, 1878

  2. [2]

    Gorilla: Large Language Model Connected with Massive APIs

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis, 2023. URLhttps://arxiv.org/abs/2305.15334

  3. [3]

    MemGPT: Towards LLMs as Operating Systems

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems, 2024. URL https: //arxiv.org/abs/2310.08560

  4. [4]

    A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), March

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), March

  5. [5]

    A survey on large language model based autonomous agents,

    ISSN 2095-2236. doi: 10.1007/s11704-024-40231-1. URL http://dx.doi.org/10. 1007/s11704-024-40231-1

  6. [6]

    ChatDev: Communicative Agents for Software Development

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Chatdev: Communicative agents for software development.arXiv preprint arXiv:2307.07924, 2023. URL https://arxiv.org/abs/2307.07924

  7. [7]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...

  8. [8]

    Towards an AI co-scientist

    Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...

  9. [9]

    Bulaong, John E

    Kyle Swanson, Wesley Wu, Nash L. Bulaong, John E. Pak, and James Zou. The virtual lab: Ai agents design new sars-cov-2 nanobodies with experimental validation.bioRxiv, 2024. doi: 10. 1101/2024.11.11.623004. URL https://www.biorxiv.org/content/early/2024/11/ 12/2024.11.11.623004

  10. [10]

    Generative Agents: Interactive Simulacra of Human Behavior

    Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023. URL https://arxiv.org/abs/2304.03442

  11. [11]

    Openmanus: An open- source framework for building general ai agents

    Xinbin Liang, Jinyu Xiang, Zhaoyang Yu, Jiayi Zhang, and Sirui Hong. Openmanus: An open- source framework for building general ai agents. https://github.com/mannaandpoem/ OpenManus, 2025

  12. [12]

    Magentic-one: A generalist multi- agent system for solving complex tasks.arXiv preprint arXiv:2411.04468, 2024

    Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, et al. Magentic-one: A generalist multi-agent system for solving complex tasks.arXiv preprint arXiv:2411.04468, 2024

  13. [13]

    LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision and the Road Ahead, 2024, [arXiv:cs.SE/2404.04834]

    Junda He, Christoph Treude, and David Lo. Llm-based multi-agent systems for software engineering: Vision and the road ahead, 2024. URL https://arxiv.org/abs/2404.04834

  14. [14]

    Roco: Dialectic multi-robot collaboration with large language models

    Zhao Mandi, Shreeya Jain, and Shuran Song. Roco: Dialectic multi-robot collaboration with large language models, 2023. URLhttps://arxiv.org/abs/2307.04738

  15. [15]

    arXiv preprint arXiv:2307.02485

    Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B. Tenenbaum, Tianmin Shu, and Chuang Gan. Building cooperative embodied agents modularly with large language models, 2024. URLhttps://arxiv.org/abs/2307.02485. 11

  16. [16]

    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate, 2023. URL https: //arxiv.org/abs/2305.14325

  17. [17]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

  18. [18]

    Large Language Model based Multi-Agents: A Survey of Progress and Challenges

    Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680, 2024

  19. [19]

    Agentless: Demystifying LLM-based Software Engineering Agents

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents, 2024. URL https://arxiv.org/abs/2407.01489

  20. [20]

    Kapoor and A

    Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. Ai agents that matter, 2024. URLhttps://arxiv.org/abs/2407.01502

  21. [21]

    Glaser and Anselm L

    Barney G. Glaser and Anselm L. Strauss.The Discovery of Grounded Theory: Strategies for Qualitative Research. Aldine Publishing Company, 1967

  22. [22]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https://arxiv.org/ abs/2306.05685

  23. [23]

    Agent workflow memory,

    Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory,

  24. [24]

    URLhttps://arxiv.org/abs/2409.07429

  25. [25]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines, 2023. URLhttps://arxiv.org/abs/2310.03714

  26. [26]

    Stateflow: Enhancing llm task-solving through state-driven workflows, 2024

    Yiran Wu, Tianwei Yue, Shaokun Zhang, Chi Wang, and Qingyun Wu. Stateflow: Enhancing llm task-solving through state-driven workflows, 2024. URLhttps://arxiv.org/abs/2403. 11322

  27. [27]

    CoRR abs/2402.03578 (2024)

    Shanshan Han, Qifan Zhang, Yuhang Yao, Weizhao Jin, Zhaozhuo Xu, and Chaoyang He. Llm multi-agent systems: Challenges and open problems, 2024. URL https://arxiv.org/abs/ 2402.03578

  28. [28]

    Lewis Hammond, Alan Chan, Jesse Clifton, Jason Hoelscher-Obermaier, Akbir Khan, Euan McLean, Chandler Smith, Wolfram Barfuss, Jakob Foerster, Tom ´aˇs Gaven ˇciak, The Anh Han, Edward Hughes, V ojtˇech Kovaˇr´ık, Jan Kulveit, Joel Z. Leibo, Caspar Oesterheld, Chris- tian Schroeder de Witt, Nisarg Shah, Michael Wellman, Paolo Bova, Theodor Cimpeanu, Carson...

  29. [29]

    SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=VTF8yNQM66

  30. [30]

    A survey of useful llm evaluation.arXiv preprint arXiv:2406.00936, 2024

    Ji-Lun Peng, Sijia Cheng, Egil Diau, Yung-Yu Shih, Po-Heng Chen, Yen-Ting Lin, and Yun- Nung Chen. A survey of useful llm evaluation.arXiv preprint arXiv:2406.00936, 2024. 12

  31. [31]

    Battleagentbench: A benchmark for evaluating cooperation and competition capabilities of language models in multi-agent systems, 2024

    Wei Wang, Dan Zhang, Tao Feng, Boyan Wang, and Jie Tang. Battleagentbench: A benchmark for evaluating cooperation and competition capabilities of language models in multi-agent systems.arXiv preprint arXiv:2408.15971, 2024

  32. [32]

    Harnessing language for coordination: A framework and benchmark for llm-driven multi-agent control.arXiv preprint arXiv:2412.11761, 2024

    Timoth´ee Anne, Noah Syrkis, Meriem Elhosni, Florian Turati, Franck Legendre, Alain Jaquier, and Sebastian Risi. Harnessing language for coordination: A framework and benchmark for llm-driven multi-agent control.arXiv preprint arXiv:2412.11761, 2024

  33. [33]

    Benchmarl: Benchmarking multi-agent reinforcement learning.Journal of Machine Learning Research, 25(217):1–10, 2024

    Matteo Bettini, Amanda Prorok, and Vincent Moens. Benchmarl: Benchmarking multi-agent reinforcement learning.Journal of Machine Learning Research, 25(217):1–10, 2024

  34. [34]

    Teamcraft: A benchmark for multi-modal multi-agent systems in minecraft.arXiv preprint arXiv:2412.05255, 2024

    Qian Long, Zhi Li, Ran Gong, Ying Nian Wu, Demetri Terzopoulos, and Xiaofeng Gao. Teamcraft: A benchmark for multi-modal multi-agent systems in minecraft.arXiv preprint arXiv:2412.05255, 2024

  35. [35]

    Lu, S., Wang, Y ., Sheng, L., He, L., Zheng, A., and Liang, J

    Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. Trustworthy llms: A survey and guideline for evaluating large language models’ alignment.arXiv preprint arXiv:2308.05374, 2023

  36. [36]

    A survey on large language model (llm) security and privacy: The good, the bad, and the ugly.High-Confidence Computing, page 100211, 2024

    Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly.High-Confidence Computing, page 100211, 2024

  37. [37]

    URL https://www.anthropic.com/research/ building-effective-agents

    Anthropic, Dec 2024. URL https://www.anthropic.com/research/ building-effective-agents

  38. [38]

    Specifications: The miss- ing link to making the development of LLM systems an engineering discipline

    Ion Stoica, Matei Zaharia, Joseph Gonzalez, Ken Goldberg, Hao Zhang, Anastasios Angelopou- los, Shishir G Patil, Lingjiao Chen, Wei-Lin Chiang, and Jared Q Davis. Specifications: The missing link to making the development of llm systems an engineering discipline.arXiv preprint arXiv:2412.05299, 2024

  39. [39]

    Gagan Bansal, Jennifer Wortman Vaughan, Saleema Amershi, Eric Horvitz, Adam Fourney, Hussein Mozannar, Victor Dibia, and Daniel S. Weld. Challenges in human-agent communication. Technical Report MSR-TR-2024-53, Microsoft, De- cember 2024. URL https://www.microsoft.com/en-us/research/publication/ human-agent-interaction-challenges/

  40. [40]

    NAACL-LONG.102

    Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page 74...

  41. [41]

    An empirical study of code generation errors made by large language models

    Song Da, Zijie Zhou, Zhijie Wang, Yuheng Huang, Shengmai Chen, Bonan Kou, Lei Ma, and Tianyi Zhang. An empirical study of code generation errors made by large language models. In In 7th Annual Symposium on Machine Programming, 2023

  42. [42]

    Negar Arabzadeh, Siqing Huo, Nikhil Mehta, Qinqyun Wu, Chi Wang, Ahmed Awadallah, Charles L. A. Clarke, and Julia Kiseleva. Assessing and verifying task utility in llm-powered applications, 2024. URLhttps://arxiv.org/abs/2405.02178

  43. [43]

    Interactive debugging and steering of multi-agent ai systems

    Will Epperson, Gagan Bansal, Victor Dibia, Adam Fourney, Jack Gerrits, Erkang (Eric) Zhu, and Saleema Amershi. Interactive debugging and steering of multi-agent ai systems. InCHI 2025, April 2025. URLhttps://arxiv.org/abs/2503.02068

  44. [44]

    Zhang, M

    Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, and Qingyun Wu. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems, 2025. URL https://arxiv.org/abs/2505.00212. 13

  45. [45]

    Theoretical sampling and category development in grounded theory.Qualitative health research, 17(8): 1137–1148, 2007

    Claire B Draucker, Donna S Martsolf, Ratchneewan Ross, and Thomas B Rusk. Theoretical sampling and category development in grounded theory.Qualitative health research, 17(8): 1137–1148, 2007

  46. [46]

    Open coding.University of Calgary, 23(2009):2009, 2009

    Shahedul Huq Khandkar. Open coding.University of Calgary, 23(2009):2009, 2009

  47. [47]

    Manus.https://manus.im/, 2025

    Manus AI. Manus.https://manus.im/, 2025

  48. [48]

    Model context protocol: Introduction

    Anthropic. Model context protocol: Introduction. https://modelcontextprotocol.io/ introduction, dec 2024

  49. [49]

    A2a: A new era of agent interoperability, April 2025

    Rao Surapaneni, Miku Jha, Michael Vakoc, and Todd Segal. A2a: A new era of agent interoperability, April 2025. URL https://developers.googleblog.com/en/ a2a-a-new-era-of-agent-interoperability/. Google Developers Blog

  50. [50]

    Llm-coordination: Evaluating and analyzing multi-agent coordination abilities in large language models, 2025

    Saaket Agashe, Yue Fan, Anthony Reyna, and Xin Eric Wang. Llm-coordination: Evaluating and analyzing multi-agent coordination abilities in large language models, 2025. URL https: //arxiv.org/abs/2310.03903

  51. [51]

    Princeton University Press, Princeton, NJ, 1984

    Charles Perrow.Normal Accidents: Living with High-Risk Technologies. Princeton University Press, Princeton, NJ, 1984. ISBN 978-0691004129

  52. [52]

    Karlene H. Roberts. New challenges in organizational research: High reliability organizations. Organization & Environment, 3(2):111–125, 1989. doi: 10.1177/108602668900300202

  53. [53]

    Reliable organizations: Present research and future directions.Journal of contingencies and crisis management., 4(2), 1996

    Gene I Rochlin. Reliable organizations: Present research and future directions.Journal of contingencies and crisis management., 4(2), 1996. ISSN 0966-0879

  54. [54]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework.arXiv preprint arXiv:2308.00352, 2023

  55. [55]

    HyperAgent: Generalist software engineering agents to solve coding tasks at scale.arXiv:2409.16299, 2024

    Huy Nhat Phan, Tien N Nguyen, Phong X Nguyen, and Nghi DQ Bui. Hyperagent: Generalist software engineering agents to solve coding tasks at scale.arXiv preprint arXiv:2409.16299, 2024

  56. [56]

    AppWorld: A controllable world of apps and people for benchmarking interactive coding agents.arXiv preprint arXiv:2407.18901,

    Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A control- lable world of apps and people for benchmarking interactive coding agents.arXiv preprint arXiv:2407.18901, 2024

  57. [57]

    Autogen: Enabling next-gen llm applications via multi-agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst Conference on Language Modeling, 2024

  58. [58]

    Chatdev: Communicative agents for software development

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15174–15186, 2024

  59. [59]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework.arXiv preprint arXiv:2308.08155, 2023

  60. [60]

    Does prompt formatting have any impact on llm performance?

    Jia He, Mukund Rungta, David Koleczek, Arshdeep Sekhon, Franklin X Wang, and Sadid Hasan. Does prompt formatting have any impact on llm performance?arXiv preprint arXiv:2411.10541, 2024

  61. [61]

    Multi-agent collaboration: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314, 2023

    Yashar Talebirad and Amirhossein Nadiri. Multi-agent collaboration: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314, 2023

  62. [62]

    ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

    Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023. 14

  63. [63]

    Large language models are better reasoners with self-verification

    Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. Large language models are better reasoners with self-verification. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023

  64. [64]

    Langgraph, 2024

    LangChain. Langgraph, 2024. URLhttps://www.langchain.com/langgraph

  65. [65]

    Building effective agents, 2024

    Anthropic. Building effective agents, 2024. URLhttps://www.anthropic.com/research/ building-effective-agents

  66. [66]

    Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in Neural Information Processing Systems, 36, 2024

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in Neural Information Processing Systems, 36, 2024

  67. [67]

    Improving llm reasoning with multi-agent tree-of-thought validator agent.arXiv preprint arXiv:2409.11527, 2024

    Fatemeh Haji, Mazal Bethany, Maryam Tabar, Jason Chiang, Anthony Rios, and Peyman Najafirad. Improving llm reasoning with multi-agent tree-of-thought validator agent.arXiv preprint arXiv:2409.11527, 2024

  68. [68]

    Towards reasoning in large language models via multi-agent peer review collaboration.arXiv preprint arXiv:2311.08152, 2023

    Zhenran Xu, Senbao Shi, Baotian Hu, Jindi Yu, Dongfang Li, Min Zhang, and Yuxiang Wu. Towards reasoning in large language models via multi-agent peer review collaboration.arXiv preprint arXiv:2311.08152, 2023

  69. [69]

    Inference scaling f laws: The limits of llm resampling with imperfect verifiers.arXiv preprint arXiv:2411.17501, 2024

    Benedikt Stroebl, Sayash Kapoor, and Arvind Narayanan. Inference scaling f laws: The limits of llm resampling with imperfect verifiers.arXiv preprint arXiv:2411.17501, 2024

  70. [70]

    Are more llm calls all you need? towards scaling laws of compound inference systems.arXiv preprint arXiv:2403.02419, 2024

    Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei Zaharia, and James Zou. Are more llm calls all you need? towards scaling laws of compound inference systems.arXiv preprint arXiv:2403.02419, 2024

  71. [71]

    Testgeneval: A real world unit test generation and test completion benchmark.arXiv preprint arXiv:2410.00752, 2024

    Kush Jain, Gabriel Synnaeve, and Baptiste Rozi `ere. Testgeneval: A real world unit test generation and test completion benchmark.arXiv preprint arXiv:2410.00752, 2024

  72. [72]

    Check your facts and try again: Improving large language models with external knowledge and automated feedback

    Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, et al. Check your facts and try again: Improving large language models with external knowledge and automated feedback.arXiv preprint arXiv:2302.12813, 2023

  73. [73]

    Ques- tion answering over knowledge bases by leveraging semantic parsing and neuro-symbolic reasoning.arXiv preprint arXiv:2012.01707, 2020

    Pavan Kapanipathi, Ibrahim Abdelaziz, Srinivas Ravishankar, Salim Roukos, Alexander Gray, Ramon Astudillo, Maria Chang, Cristina Cornelio, Saswati Dana, Achille Fokoue, et al. Ques- tion answering over knowledge bases by leveraging semantic parsing and neuro-symbolic reasoning.arXiv preprint arXiv:2012.01707, 2020

  74. [74]

    A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024

    Xinyi Li, Sai Wang, Siqi Zeng, Yu Wu, and Yi Yang. A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024

  75. [75]

    Multi-agent graph-attention communica- tion and teaming

    Yaru Niu, Rohan R Paleja, and Matthew C Gombolay. Multi-agent graph-attention communica- tion and teaming. InAAMAS, volume 21, page 20th, 2021

  76. [76]

    Learning attentional communication for multi-agent coopera- tion.Advances in neural information processing systems, 31, 2018

    Jiechuan Jiang and Zongqing Lu. Learning attentional communication for multi-agent coopera- tion.Advances in neural information processing systems, 31, 2018

  77. [77]

    Learning when to communicate at scale in multiagent cooperative and competitive tasks.arXiv preprint arXiv:1812.09755, 2018

    Amanpreet Singh, Tushar Jain, and Sainbayar Sukhbaatar. Learning when to communicate at scale in multiagent cooperative and competitive tasks.arXiv preprint arXiv:1812.09755, 2018

  78. [78]

    The surprising effectiveness of ppo in cooperative multi-agent games.Advances in Neural Information Processing Systems, 35:24611–24624, 2022

    Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of ppo in cooperative multi-agent games.Advances in Neural Information Processing Systems, 35:24611–24624, 2022

  79. [79]

    Heterogeneous multi-agent reinforce- ment learning for zero-shot scalable collaboration.arXiv preprint arXiv:2404.03869, 2024

    Xudong Guo, Daming Shi, Junjie Yu, and Wenhui Fan. Heterogeneous multi-agent reinforce- ment learning for zero-shot scalable collaboration.arXiv preprint arXiv:2404.03869, 2024

  80. [80]

    Optima: Optimizing effectiveness and efficiency for llm-based multi-agent system.arXiv preprint arXiv:2410.08115, 2024

    Weize Chen, Jiarui Yuan, Chen Qian, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Optima: Optimizing effectiveness and efficiency for llm-based multi-agent system.arXiv preprint arXiv:2410.08115, 2024. 15

Showing first 80 references.