The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
hub
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
27 Pith papers cite this work. Polarity classification is still indexing.
abstract
Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human evaluation processes often involve multiple human annotators collaborating in the evaluation, we resort to a multi-agent debate framework, moving beyond single-agent prompting strategies. The multi-agent-based approach enables a group of LLMs to synergize with an array of intelligent counterparts, harnessing their distinct capabilities and expertise to enhance efficiency and effectiveness in handling intricate tasks. In this paper, we construct a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models on open-ended questions and traditional natural language generation (NLG) tasks. Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments. Our code is available at https://github.com/chanchimin/ChatEval.
hub tools
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.
AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.
Autonomous visual AI agents spontaneously form image reply chains, maintain stable individual styles, and produce richer style-diverse conversations than single agents can achieve alone.
OmniBehavior benchmark demonstrates that LLMs simulating real human behavior converge on hyper-active positive average personas, losing long-tail individual differences.
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
Spectral features of the successor representation matrix for multi-agent LLM communication topologies predict robustness to perturbations, consensus formation, and error accumulation, with an extension to account for bias drift.
LLM-X is a scalable architecture for direct negotiation and communication among personal LLM agents, featuring federated gateways, typed protocols, and policy enforcement, shown stable in experiments with up to 12 agents.
Introduces RevCI benchmark and IMPACT multi-agent framework for evidence-level contradiction detection and graded intensity scoring in peer reviews, distilled into efficient TIDE model.
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
Pact is a choreographic language extended with game-theoretic operations that maps every protocol to a formal game for reasoning about agent decisions and solving for decision policies.
MultEval supports collaborative creation of LLM-as-a-judge criteria by surfacing disagreements via consensus-building methods, allowing iterative revisions with examples and history, and keeping transparent how human judgments become automated rules.
TeamFusion uses per-member proxy agents and iterative structured discussions to generate more representative and consensual team deliverables than direct aggregation in open-ended tasks.
PARM adapts reward models to multi-stage LLM pipelines via pipeline data and direct preference optimization, improving execution rate and solving accuracy on optimization benchmarks and showing transfer to GSM8K.
SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.
HCP-MAD reduces token costs in multi-agent debates by using heterogeneous consensus verification, adaptive pair-agent stopping, and escalated collective voting based on task complexity signals.
LLM agent societies develop power-law coordination cascades and intellectual elites through an integration bottleneck that grows with system size.
TRUST is a decentralized AI auditing framework that decomposes reasoning into HDAGs, maps agent interactions via the DAAN protocol to CIGs, and uses stake-weighted multi-tier consensus to achieve 72.4% accuracy while proving a Safety-Profitability Theorem that rewards honest auditors.
Generative multi-agent systems exhibit emergent collusion and conformity behaviors that cannot be prevented by existing agent-level safeguards.
Multi-agent debate with tit-for-tat arguments and a judge LLM improves reasoning by preventing LLMs from locking into incorrect initial solutions.
Agentic AI needs social theory as structural priors in the MASS framework to model emergent dynamics from multi-agent interactions.
BLUEmed combines hybrid RAG with structured multi-agent debate and a safety filter to detect terminology substitution errors in clinical notes, reaching 69.13% accuracy under few-shot prompting and outperforming single-agent and debate-only baselines.
Deterministic multi-agent intent routing can reduce hallucinations in generative engines to near zero by limiting LLMs to intent routers and handing off tasks to specialized agents.
citing papers explorer
-
Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies
Spectral features of the successor representation matrix for multi-agent LLM communication topologies predict robustness to perturbations, consensus formation, and error accumulation, with an extension to account for bias drift.
-
TeamFusion: Supporting Open-ended Teamwork with Multi-Agent Systems
TeamFusion uses per-member proxy agents and iterative structured discussions to generate more representative and consensual team deliverables than direct aggregation in open-ended tasks.
-
Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate
HCP-MAD reduces token costs in multi-agent debates by using heterogeneous consensus verification, adaptive pair-agent stopping, and escalated collective voting based on task complexity signals.
-
Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems
LLM agent societies develop power-law coordination cascades and intellectual elites through an integration bottleneck that grows with system size.
-
Emergent Social Intelligence Risks in Generative Multi-Agent Systems
Generative multi-agent systems exhibit emergent collusion and conformity behaviors that cannot be prevented by existing agent-level safeguards.
-
Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems
Agentic AI needs social theory as structural priors in the MASS framework to model emergent dynamics from multi-agent interactions.