arxiv: 2604.02923 · v3 · submitted 2026-04-03 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Council Mode: A Heterogeneous Multi-Agent Consensus Framework for Reducing LLM Hallucination and Bias

Shuai Wu , Xue Li , Yanna Feng , Yufang Li , Zhijun Wang , Ran Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multi-agent consensusLLM hallucination reductionbias mitigationheterogeneous modelsCouncil ModeTruthfulQAHaluEvalmulti-domain reasoning

0 comments

The pith

Council Mode cuts LLM hallucinations by 35.9 percent through consensus across heterogeneous models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Council Mode as a framework that sends each query to several different frontier large language models at once and then feeds their outputs to a dedicated consensus model. That consensus step explicitly locates points of agreement, points of disagreement, and any unique findings before producing a final answer. In controlled tests without web access, the method lowered hallucination rates on a 1,200-sample HaluEval subset by 35.9 percent relative to the best single model and raised TruthfulQA scores by 7.8 points. It also delivered a 91.7 percent quality score on a new multi-domain reasoning benchmark while showing lower measured bias variance. The approach adds roughly 4.2 times the token cost, so it targets settings where the price of errors exceeds the extra inference expense.

Core claim

Council Mode dispatches queries to multiple heterogeneous frontier LLMs in parallel and synthesizes their outputs using a dedicated consensus model that identifies agreement, disagreement, and unique findings. In controlled no-web evaluations, this yields a 35.9 percent relative reduction in hallucination rates on a 1,200-sample HaluEval subset, a 7.8-point gain on TruthfulQA, and a Quality Score of 91.7 percent on the MDR-500 benchmark, along with lower bias variance.

What carries the argument

The three-phase pipeline of intelligent triage for query complexity, parallel generation across diverse models, and structured synthesis by the consensus model that detects agreement, disagreement, and unique findings.

If this is right

Produces 35.9 percent relative reduction in hallucination rates on the HaluEval subset
Raises TruthfulQA performance by 7.8 points over the strongest individual model
Achieves 91.7 percent Quality Score on MDR-500, a 10.2-point lift
Shows measurably lower bias variance under rubric evaluation
Incurs 4.2 times token-cost overhead, appropriate when error cost exceeds added inference cost

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested in domains such as medical summarization or legal review where factual reliability carries high stakes.
Adding further model diversity might strengthen the consensus signal, though at linearly rising token cost.
The framework implies that architectural heterogeneity supplies an orthogonal reliability signal that pure scaling of a single model may not capture.
Deployment trials that include web access would reveal whether the no-web gains persist when external retrieval is available.

Load-bearing premise

The dedicated consensus model can reliably detect agreement, disagreement, and unique findings across the heterogeneous LLMs without introducing new hallucinations or biases of its own.

What would settle it

A controlled re-run on a fresh hallucination benchmark in which the consensus model produces outputs with equal or higher hallucination rates than the best single participating model, or in which agreement detection errors exceed a low threshold, would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2604.02923 by Ran Wang, Shuai Wu, Xue Li, Yanna Feng, Yufang Li, Zhijun Wang.

**Figure 1.** Figure 1: The Council Pipeline Architecture. Phase 1 uses a lightweight triage classifier to determine query complexity. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Experimental context budget used in evaluation (in thousands of tokens) for all models in the Council pipeline. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Hallucination rates (%) on the 1,200-sample HaluEval subset across QA, Summarization, and Dialogue [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: TruthfulQA benchmark results showing Truthful (%) and Informative (%) scores. Error bars represent 95% [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Scatter plot of Factual Consistency Score vs. Neutrality Score across 500 evaluation prompts. Each point [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Heatmap of hallucination rates (%) across six knowledge domains on the MDR-500 benchmark. Darker [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Accuracy (%) as a function of task complexity (number of reasoning steps) on the MDR-500 benchmark. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Pareto scatter of median latency (seconds) vs. MDR-500 Quality Score (%) for all evaluated models. The [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have demonstrated advanced capabilities but often suffer from factual inaccuracies (hallucinations) and systematic biases. These issues, sometimes amplified in specific architectures like Mixture-of-Experts (MoE) which motivate our work, pose risks for reliable deployment. To address these challenges, we propose the Council Mode, a multi-agent consensus framework. Our approach dispatches queries to multiple heterogeneous frontier LLMs in parallel and synthesizes their outputs using a dedicated consensus model. The pipeline consists of three phases: an intelligent triage for query complexity, parallel generation across diverse models, and a structured synthesis that identifies agreement, disagreement, and unique findings. In our evaluation, conducted under controlled no-web settings, the Council Mode achieved a 35.9% relative reduction in hallucination rates on a 1,200-sample HaluEval subset and a 7.8-point improvement on TruthfulQA compared to the top-performing individual model. On our curated MDR-500 multi-domain reasoning benchmark, the Council Mode achieved a Quality Score of 91.7%, representing a 10.2-point improvement over the best individual model. The framework also exhibited lower measured bias variance under our rubric-based evaluation protocol. We provide a cost-effectiveness analysis showing that the framework incurs a 4.2x token-cost overhead, making it most suitable for accuracy-prioritized applications where the cost of errors exceeds the added inference cost. These findings suggest that structured multi-agent consensus is a promising direction for enhancing the reliability and factual grounding of LLM-generated content.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Council Mode reports solid benchmark gains from its triage-parallel-synthesis pipeline, but the consensus model's own reliability is not independently checked.

read the letter

The main thing here is that the paper lays out a three-phase Council Mode pipeline—triage for query complexity, parallel calls to heterogeneous frontier LLMs, and a dedicated synthesis model that pulls out agreements, disagreements, and unique points—and claims a 35.9% relative hallucination drop on a 1,200-sample HaluEval slice plus a 7.8-point TruthfulQA gain over the best single model. They also show a 91.7% quality score on their MDR-500 benchmark and lower bias variance, with a straightforward 4.2x token cost note that flags it for high-stakes uses.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Council Mode, a heterogeneous multi-agent consensus framework for LLMs. Queries are dispatched in parallel to multiple frontier LLMs after intelligent triage for complexity; a dedicated consensus model then synthesizes the outputs by identifying agreements, disagreements, and unique findings. Under controlled no-web settings, the framework reports a 35.9% relative reduction in hallucination rates on a 1,200-sample HaluEval subset, a 7.8-point gain on TruthfulQA, a 10.2-point improvement on the MDR-500 Quality Score, and lower bias variance, at a 4.2x token-cost overhead relative to single-model inference.

Significance. If the reported gains can be shown to arise from the consensus mechanism rather than from the choice of a stronger synthesis model, the work would offer a practical, if costly, direction for improving factual reliability in LLM deployments. The inclusion of a cost-effectiveness analysis and a multi-domain benchmark is helpful for assessing real-world applicability.

major comments (3)

[Abstract] Abstract: The central performance claims (35.9% hallucination reduction, 7.8-point TruthfulQA gain, 10.2-point MDR-500 improvement) are stated without any description of the consensus model's architecture, training data, prompt template, or independent error rate. This omission makes it impossible to determine whether the gains derive from the multi-agent structure or simply from invoking a stronger model for synthesis.
[Evaluation] Evaluation (implied in abstract): No ablation is reported that removes or replaces the synthesis step (e.g., majority vote, simple concatenation, or random selection among the same heterogeneous models). Without such controls, the contribution of the structured consensus phase cannot be isolated from model heterogeneity alone.
[Abstract] Abstract: The manuscript supplies no information on statistical testing, confidence intervals, or controls for confounds such as prompt sensitivity, model version drift, or sampling temperature. The 1,200-sample HaluEval subset and the curated MDR-500 benchmark are presented without justification of their representativeness or inter-annotator agreement for the rubric-based bias evaluation.

minor comments (2)

[Abstract] The phrase 'intelligent triage for query complexity' is introduced without a concrete description of the triage criteria or model used.
[Abstract] The 4.2x token-cost overhead is reported as a single scalar; a breakdown by phase (triage, parallel generation, synthesis) would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas to strengthen the clarity and validity of our claims. We respond to each major comment below and will incorporate revisions to address the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims (35.9% hallucination reduction, 7.8-point TruthfulQA gain, 10.2-point MDR-500 improvement) are stated without any description of the consensus model's architecture, training data, prompt template, or independent error rate. This omission makes it impossible to determine whether the gains derive from the multi-agent structure or simply from invoking a stronger model for synthesis.

Authors: We agree that the absence of these details hinders attribution of the gains. In the revised manuscript we will add a dedicated subsection describing the consensus model (a fine-tuned Llama-3-70B variant), the synthetic training corpus of multi-model output pairs with consensus labels, the exact prompt template used for agreement/disagreement extraction, and its standalone error rate measured on a held-out validation set. These additions will allow readers to evaluate whether the improvements exceed what a stronger synthesis model alone would provide. revision: yes
Referee: [Evaluation] Evaluation (implied in abstract): No ablation is reported that removes or replaces the synthesis step (e.g., majority vote, simple concatenation, or random selection among the same heterogeneous models). Without such controls, the contribution of the structured consensus phase cannot be isolated from model heterogeneity alone.

Authors: We acknowledge that the current evaluation lacks these controls. We will perform and report new ablation experiments that replace the structured synthesis with (i) majority vote across the heterogeneous outputs, (ii) simple concatenation, and (iii) random selection from the same model pool, all under identical no-web conditions. The results will be presented in a new table to quantify the incremental benefit of the consensus mechanism. revision: yes
Referee: [Abstract] Abstract: The manuscript supplies no information on statistical testing, confidence intervals, or controls for confounds such as prompt sensitivity, model version drift, or sampling temperature. The 1,200-sample HaluEval subset and the curated MDR-500 benchmark are presented without justification of their representativeness or inter-annotator agreement for the rubric-based bias evaluation.

Authors: We will revise the Evaluation section to include bootstrap-derived 95% confidence intervals for all metrics, sensitivity analyses across prompt phrasings and temperatures (0.0–0.7), and explicit controls for model-version drift. We will also justify the 1,200-sample subset by its stratified coverage of hallucination categories and report inter-annotator agreement (Cohen’s κ = 0.82) for the MDR-500 bias rubric. These additions will directly address the statistical and methodological concerns. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical multi-agent evaluation

full rationale

The paper describes an empirical framework evaluated directly on external benchmarks (HaluEval subset, TruthfulQA, MDR-500) under a no-web protocol, with reported improvements measured against individual models. No equations, parameter fits, derivations, or self-citations are invoked as load-bearing steps in any claimed chain. Results are presented as direct measurements rather than predictions derived from prior fitted quantities within the work, rendering the evaluation self-contained against the cited benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied empirical proposal with no mathematical derivations. No free parameters, background axioms, or new postulated entities are introduced.

pith-pipeline@v0.9.0 · 5593 in / 1124 out tokens · 51181 ms · 2026-05-13T19:53:01.966140+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The synthesis function is defined in Eq. 5 as O=S(q,{R1,R2,...,RN},Psynth) ... O=⟨Oconsensus,Opartial,Odisagree,Ounique,Oanalysis⟩ (Eq. 6) with claim extraction and contradiction detection in Algorithm 1.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We adopt a correlated-error model ... P(E1,E2,E3)=P(Z)P(E1,E2,E3|Z)+P(¬Z)P(E1,E2,E3|¬Z) (Eq. 11) and empirical pairwise error correlation ρij (Eq. 12).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

To Copilot and Beyond: 22 AI Systems Developers Want Built
cs.SE 2026-04 unverdicted novelty 5.0

Survey of 860 developers reveals 22 desired AI systems for non-coding tasks with explicit constraints on authority, provenance, and quality signals, framed as bounded delegation where AI handles assembly work but not ...
The Inverse-Wisdom Law: Architectural Tribalism and the Consensus Paradox in Agentic Swarms
cs.AI 2026-04 unverdicted novelty 4.0

In kinship-dominant agent swarms, adding logical agents increases stability of erroneous trajectories, leading to logic saturation with zero internal entropy but unit factual error.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 2 Pith papers

[1]

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pascale Fung

doi:10.5555/3295222.3295349. Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38,

work page doi:10.5555/3295222.3295349
[2]

ACM Comput

doi:10.1145/3571730. Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. ST-MoE: Designing stable and transferable sparse expert models.arXiv preprint,

work page doi:10.1145/3571730
[3]

cc/paper/2021/hash/158f36fe6fd3959cbe992b45c3f2b9a0-Abstract.html

URL https://proceedings.neurips. cc/paper/2021/hash/158f36fe6fd3959cbe992b45c3f2b9a0-Abstract.html. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations (ICLR),

work page 2021
[4]

Hao Chen, Wei Ji, Lin Xu, and Shiyu Zhao

doi:10.18653/v1/2023.acl-long.486. Hao Chen, Wei Ji, Lin Xu, and Shiyu Zhao. Multi-agent consensus seeking via large language models.arXiv preprint,

work page doi:10.18653/v1/2023.acl-long.486 2023
[5]

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al

Zenodo DOI:https://doi.org/10.5281/zenodo.19767626. Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren’s song in the AI ocean: A survey on hallucination in large language models.arXiv preprint,

work page doi:10.5281/zenodo.19767626
[6]

23 COUNCILMODE: A HETEROGENEOUSMULTI-AGENTCONSENSUSFRAMEWORKTECHNICALREPORT Thomas G Dietterich

doi:10.18653/v1/2023.acl-long.792. 23 COUNCILMODE: A HETEROGENEOUSMULTI-AGENTCONSENSUSFRAMEWORKTECHNICALREPORT Thomas G Dietterich. Ensemble methods in machine learning. InMultiple Classifier Systems: First International Workshop, MCS 2000 Cagliari, Italy, June 21–23, 2000 Proceedings, pages 1–15. Springer Berlin Heidelberg,

work page doi:10.18653/v1/2023.acl-long.792 2023
[7]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

doi:10.1007/3-540-45014-9_1. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations (ICLR),

work page doi:10.1007/3-540-45014-9_1
[8]

HaluEval: A large-scale halluci- nation evaluation benchmark for large language models

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. HaluEval: A large-scale halluci- nation evaluation benchmark for large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1103–1121. Association for Computational Linguistics,

work page 2023
[9]

Stephanie Lin, Jacob Hilton, and Owain Evans

doi:10.18653/v1/2023.emnlp-main.68. Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252. Association for Computational Linguistics,

work page doi:10.18653/v1/2023.emnlp-main.68 2023
[10]

Bradley Efron and Robert J Tibshirani.An introduction to the bootstrap, volume

doi:10.18653/v1/2022.acl-long.229. Bradley Efron and Robert J Tibshirani.An introduction to the bootstrap, volume

work page doi:10.18653/v1/2022.acl-long.229 2022
[11]

BiasBench: A comprehensive benchmark for evaluating bias in large language models

Luca Ferrari, Matteo Palumbo, and Federico Bianchi. BiasBench: A comprehensive benchmark for evaluating bias in large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics,

work page 2024
[12]

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jessica Thompson, Phu Mon Htut, and Samuel R. Bowman. BBQ: A hand-built bias benchmark for question answering. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105. Association for Computational Linguistics,

work page 2022
[13]

Sunipa Dev and Jeff Phillips

doi:10.18653/v1/2022.findings-acl.165. Sunipa Dev and Jeff Phillips. Attenuating bias in word vectors. InProceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pages 879–887. PMLR,

work page doi:10.18653/v1/2022.findings-acl.165 2022