The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Battleagentbench: A benchmark for evaluating cooperation and competition capabilities of language models in multi-agent systems
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
DESBench reveals structural trade-offs among centralized, hierarchical, heterarchical, and holonic coordination in dynamic industrial scheduling that outcome metrics alone miss.
C2C is a new testbed where LM agents negotiate differently from humans and targeted prompting raises their win rate from 22.2% to 32.7% across 1,100+ games.
CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.
citing papers explorer
-
Why Do Multi-Agent LLM Systems Fail?
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
-
When Does Hierarchy Help? Benchmarking Agent Coordination in Event-Driven Industrial Scheduling
DESBench reveals structural trade-offs among centralized, hierarchical, heterarchical, and holonic coordination in dynamic industrial scheduling that outcome metrics alone miss.
-
Cooperate to Compete: Strategic Coordination in Multi-Agent Conquest
C2C is a new testbed where LM agents negotiate differently from humans and targeted prompting raises their win rate from 22.2% to 32.7% across 1,100+ games.
-
CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V
CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.