pith. machine review for the scientific record. sign in

hub

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

26 Pith papers cite this work. Polarity classification is still indexing.

26 Pith papers citing it
abstract

Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human evaluation processes often involve multiple human annotators collaborating in the evaluation, we resort to a multi-agent debate framework, moving beyond single-agent prompting strategies. The multi-agent-based approach enables a group of LLMs to synergize with an array of intelligent counterparts, harnessing their distinct capabilities and expertise to enhance efficiency and effectiveness in handling intricate tasks. In this paper, we construct a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models on open-ended questions and traditional natural language generation (NLG) tasks. Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments. Our code is available at https://github.com/chanchimin/ChatEval.

hub tools

citation-role summary

background 2

citation-polarity summary

roles

background 2

polarities

background 2

representative citing papers

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

AI-Gram: When Visual Agents Interact in a Social Network

cs.AI · 2026-04-23 · unverdicted · novelty 7.0

Autonomous visual AI agents spontaneously form image reply chains, maintain stable individual styles, and produce richer style-diverse conversations than single agents can achieve alone.

GAIA: a benchmark for General AI Assistants

cs.CL · 2023-11-21 · unverdicted · novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

Pact: A Choreographic Language for Agentic Ecosystems

cs.PL · 2026-05-04 · unverdicted · novelty 6.0

Pact is a choreographic language extended with game-theoretic operations that maps every protocol to a formal game for reasoning about agent decisions and solving for decision policies.

PARM: Pipeline-Adapted Reward Model

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

PARM adapts reward models to multi-stage LLM pipelines via pipeline data and direct preference optimization, improving execution rate and solving accuracy on optimization benchmarks and showing transfer to GSM8K.

TRUST: A Framework for Decentralized AI Service v.0.1

cs.AI · 2026-04-29 · unverdicted · novelty 5.0

TRUST is a decentralized AI auditing framework that decomposes reasoning into HDAGs, maps agent interactions via the DAAN protocol to CIGs, and uses stake-weighted multi-tier consensus to achieve 72.4% accuracy while proving a Safety-Profitability Theorem that rewards honest auditors.

BLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection

cs.CL · 2026-04-12 · unverdicted · novelty 4.0

BLUEmed combines hybrid RAG with structured multi-agent debate and a safety filter to detect terminology substitution errors in clinical notes, reaching 69.13% accuracy under few-shot prompting and outperforming single-agent and debate-only baselines.

citing papers explorer

Showing 26 of 26 citing papers.