The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Mixed citations
Battleagentbench: A benchmark for evaluating cooperation and competition capabilities of language models in multi-agent systems.arXiv preprint arXiv:2408.15971
Mixed citation behavior. Most common role is background (60%).
citation-role summary
citation-polarity summary
representative citing papers
A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
DESBench reveals structural trade-offs among centralized, hierarchical, heterarchical, and holonic coordination in dynamic industrial scheduling that outcome metrics alone miss.
C2C is a new testbed where LM agents negotiate differently from humans and targeted prompting raises their win rate from 22.2% to 32.7% across 1,100+ games.
CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.
Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.
Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.
citing papers explorer
-
Why Do Multi-Agent LLM Systems Fail?
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
-
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
-
When Does Hierarchy Help? Benchmarking Agent Coordination in Event-Driven Industrial Scheduling
DESBench reveals structural trade-offs among centralized, hierarchical, heterarchical, and holonic coordination in dynamic industrial scheduling that outcome metrics alone miss.
-
Cooperate to Compete: Strategic Coordination in Multi-Agent Conquest
C2C is a new testbed where LM agents negotiate differently from humans and targeted prompting raises their win rate from 22.2% to 32.7% across 1,100+ games.
-
CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V
CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.
-
Interactive Evaluation Requires a Design Science
Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.
-
Agentic Reasoning for Large Language Models
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.