Minority Sentinel: When to Overturn Majority Voting in Multi-Agent LLM Debates

Chuan He; Dong Wen; Guanfeng Liu; Jiate Liu; Mingchen Ju; Shaobo Qiao; Zebin Chen; Zhengyi Yang

arxiv: 2606.29270 · v1 · pith:B5TATFYXnew · submitted 2026-06-28 · 💻 cs.MA

Minority Sentinel: When to Overturn Majority Voting in Multi-Agent LLM Debates

Chuan He , Zebin Chen , Zhengyi Yang , Shaobo Qiao , Mingchen Ju , Jiate Liu , Dong Wen , Guanfeng Liu This is my paper

Pith reviewed 2026-06-30 02:21 UTC · model grok-4.3

classification 💻 cs.MA

keywords multi-agent debatemajority votingminority truthlightgbm classifierflip precisionllm reasoningdebate logscondorcet jury theorem

0 comments

The pith

A lightweight classifier can overturn majority votes in LLM debates when the minority answer is correct by reading behavioral signals in the logs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-agent debate systems use majority voting to boost LLM reasoning under the assumption that agent errors are independent, yet shared pretraining data makes errors correlated so the majority often buries the correct minority view. Experiments with three heterogeneous LLMs across six benchmarks show that the minority holds the right answer in roughly one quarter of divergent cases, leaving a 10-percentage-point recovery margin. Minority Sentinel builds a multi-dimensional fingerprint from the debate logs and trains a LightGBM model to decide when to flip the majority decision. The model reaches 81.2 percent flip precision and positive net gain on every dataset and every random seed, while an LLM-as-Judge baseline produces negative net gain. The result demonstrates that the logs already contain enough behavioral information for a non-LLM classifier to intervene safely.

Core claim

The paper claims that debate logs from three heterogeneous LLM agents contain sufficient behavioral signals for a LightGBM classifier trained on multi-dimensional debate fingerprints to identify cases where the minority answer is correct, achieving a stable 81.2 percent flip precision and positive net gain across all six benchmarks and all 20 random seed trials while avoiding the accuracy degradation seen with LLM-as-Judge baselines.

What carries the argument

The multi-dimensional debate fingerprint extracted from debate logs, which LightGBM uses to predict when overturning the majority vote will recover a correct minority answer.

If this is right

Selective overturns based on the classifier improve overall system accuracy without changing the base LLMs or adding more agents.
Behavioral signals in the logs support safer flips than asking another LLM to judge the debate.
The positive net gain holds across all tested datasets and random seeds, indicating stability of the signals.
Roughly one in four divergent cases offers a recoverable minority truth that majority voting otherwise suppresses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fingerprint approach could be tested on debates involving more than three agents to check whether disagreement patterns remain informative.
Explicit logging of agent disagreement trajectories might eventually replace simple majority voting in multi-agent systems.
Extending the method to open-ended or long-form tasks would test whether the behavioral signals generalize beyond the six benchmarks used here.

Load-bearing premise

The behavioral signals recorded in the debate logs are consistent enough that a LightGBM model trained on the six benchmarks will continue to produce high flip precision and positive net gain on new data.

What would settle it

Applying the trained Minority Sentinel to a fresh set of benchmarks or different LLMs and observing flip precision below 60 percent or negative net gain would show that the signals are not sufficient or generalizable.

Figures

Figures reproduced from arXiv: 2606.29270 by Chuan He, Dong Wen, Guanfeng Liu, Jiate Liu, Mingchen Ju, Shaobo Qiao, Zebin Chen, Zhengyi Yang.

**Figure 1.** Figure 1: System overview of the Minority Sentinel framework. The Diagnosis phase collects divergent samples through [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Top-10 feature importance by split count. The top [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Threshold sweep analysis: (a) Net Gain as a function of the decision threshold [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: 20-seed stability. All seeds produce positive Net [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Multi-Agent Debate (MAD) with Majority Voting is a dominant paradigm for improving LLM reasoning, yet its effectiveness rests on the Condorcet Jury Theorem's assumption of independent errors. Because contemporary LLMs share similar pretraining corpora, their errors are strongly correlated, causing the majority to systematically suppress correct minority opinions, a phenomenon we term Minority Truth. Through debates among three heterogeneous LLM agents on six benchmarks, we find that roughly one in four divergent cases has the minority holding the correct answer, yielding a 10-percentage-point theoretical recovery margin. We propose Minority Sentinel, a lightweight meta-classifier that extracts a multi-dimensional debate fingerprint from debate logs and trains a LightGBM model to decide when to overturn majority voting. Minority Sentinel achieves a stable Flip Precision of 81.2% with positive Net Gain across all six datasets and all 20 random seed trials, demonstrating that debate logs contain sufficient behavioral signals for a non-LLM classifier to reliably recover suppressed minorities without degrading system accuracy. The LLM-as-Judge baseline yields negative Net Gain despite higher recall, confirming that flip safety, not recovery volume, determines intervention value.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A LightGBM meta-classifier on debate logs can flip majority votes with 81% precision and net gain, but same-benchmark training leaves generalization unproven.

read the letter

The main thing to know is that a LightGBM model trained on debate fingerprints can flip majority votes to the minority opinion in multi-agent LLM debates at 81% precision, delivering positive net accuracy gains across the six benchmarks and all seeds tested.

What the paper does is introduce this meta-classifier as an alternative to pure majority voting or using an LLM judge. They document that minorities hold the correct answer in roughly one in four disagreement cases, then show their approach recovers many of those flips safely. The results are consistent, and it outperforms the LLM-as-judge baseline on net gain because it prioritizes precision over recall.

The execution looks reasonable for the reported experiments. Using heterogeneous agents and multiple random seeds adds some robustness to the findings.

The soft spot is the evaluation setup. Training and testing on the same six benchmarks means the classifier might be capturing dataset-specific correlations in how the agents respond to those particular problems rather than general signals in any debate. Without cross-benchmark validation or tests on unseen tasks, the claim that debate logs contain reliable, transferable behavioral signals for recovering minorities is not fully backed up. That is the main limitation.

This work is for researchers focused on multi-agent debate systems and ways to make them more accurate. A reader interested in practical fixes for correlated errors in LLMs would get value from the method and the numbers. It deserves a serious referee because it has clear empirical support and addresses a real issue in the MAD paradigm.

I would recommend sending it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper identifies that majority voting in three-agent LLM debates systematically suppresses correct minority answers (Minority Truth) in roughly 25% of divergent cases across six benchmarks, creating a 10pp recovery margin. It introduces Minority Sentinel, a LightGBM meta-classifier trained on multi-dimensional debate fingerprints extracted from logs, which decides when to flip the majority vote. The method reports 81.2% flip precision and positive net gain on all datasets and all 20 random seeds, while an LLM-as-judge baseline yields negative net gain despite higher recall.

Significance. If the generalization claim holds, the work supplies concrete evidence that lightweight, non-LLM classifiers can recover suppressed correct answers from behavioral signals in debate logs, improving MAD accuracy without extra LLM inference cost. The consistent positive net gain across seeds and the direct comparison to LLM-as-judge baselines are notable strengths; the approach is falsifiable via the reported flip-precision and net-gain metrics.

major comments (2)

[Experimental Evaluation / Results] Experimental section (implicit in abstract and results): training and test splits are performed within the same six benchmarks without reported leave-one-benchmark-out or external-task validation. This setup risks the LightGBM model capturing benchmark-specific response patterns or task formats rather than domain-agnostic debate signals, which directly weakens the claim that the fingerprint enables reliable recovery 'in general MAD settings.'
[Results / Net Gain] § on Net Gain calculation: the definition of net gain and the precise weighting of false-positive flips versus true-positive recoveries are not fully specified, making it impossible to verify that the reported positive net gain is robust to alternative cost assumptions or to confirm it is not an artifact of post-hoc threshold selection on the same data.

minor comments (2)

[Method] The multi-dimensional fingerprint features are described at a high level; an explicit list or table of the features used (e.g., token entropy, agreement ratios, response length) would improve reproducibility.
[Figures] Figure captions and axis labels for the flip-precision and net-gain plots should include the exact number of trials (20 seeds) and confidence intervals to allow readers to assess stability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: Experimental section (implicit in abstract and results): training and test splits are performed within the same six benchmarks without reported leave-one-benchmark-out or external-task validation. This setup risks the LightGBM model capturing benchmark-specific response patterns or task formats rather than domain-agnostic debate signals, which directly weakens the claim that the fingerprint enables reliable recovery 'in general MAD settings.'

Authors: We agree that the current within-benchmark splits limit the strength of the generalization claim. In the revised version we will add leave-one-benchmark-out experiments (training on five benchmarks and evaluating on the held-out benchmark) together with a summary of performance variance across folds. These additional results will be reported in a new subsection of the experimental evaluation. revision: yes
Referee: § on Net Gain calculation: the definition of net gain and the precise weighting of false-positive flips versus true-positive recoveries are not fully specified, making it impossible to verify that the reported positive net gain is robust to alternative cost assumptions or to confirm it is not an artifact of post-hoc threshold selection on the same data.

Authors: We accept that the net-gain definition and weighting require explicit formalization. The revised manuscript will include the exact formula (net gain = TP recoveries imes benefit − FP flips imes cost) with the default 1:1 cost ratio, a sensitivity table for alternative ratios (1:2 and 2:1), and confirmation that the positive net gain remains stable under these weightings. We will also state that the threshold was selected via cross-validation on the training folds only. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical metrics are measured outcomes

full rationale

The paper's core claim rests on experimental results: a LightGBM meta-classifier is trained on debate fingerprints extracted from logs generated by three LLMs across six benchmarks, then evaluated for Flip Precision (81.2%) and Net Gain. These quantities are reported as observed performance across datasets and random seeds, not quantities defined in terms of themselves or forced by construction from fitted parameters. No equations, self-citations, ansatzes, or uniqueness theorems are invoked to derive the result; the demonstration that 'debate logs contain sufficient behavioral signals' is an empirical finding rather than a self-referential reduction. Potential issues of train/eval overlap on the same benchmarks affect generalization validity but do not match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The paper's contribution rests on the empirical validation of the meta-classifier and the domain assumption about LLM error correlation; no new physical entities or unproven math axioms are introduced.

free parameters (1)

LightGBM model parameters
The classifier is trained on debate data, so its internal parameters are fitted to maximize the reported precision.

axioms (1)

domain assumption Contemporary LLMs share similar pretraining corpora leading to strongly correlated errors
Invoked in the abstract to explain why majority voting suppresses correct minorities.

invented entities (1)

Minority Truth no independent evidence
purpose: To term the phenomenon of correct minority opinions being suppressed by majority voting
New term coined in the paper based on experimental observations.

pith-pipeline@v0.9.1-grok · 5745 in / 1331 out tokens · 67085 ms · 2026-06-30T02:21:49.652427+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Rui Ai, Yuqi Pan, David Simchi-Levi, Milind Tambe, and Haifeng Xu. 2025. Beyond Majority Voting: LLM Aggregation by Leveraging Higher-Order Information. arXiv preprint arXiv:2510.01499(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2024. ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate. InProceedings of the 12th International Conference on Learning Representations (ICLR)

2024
[3]

Chia-Yuan Chang, Zhimeng Jiang, Vineeth Rakesh, Menghai Pan, Chin- Chia Michael Yeh, Guanchu Wang, Mingzhi Hu, Zhichao Xu, Yan Zheng, Ma- hashweta Das, and Na Zou. 2025. MAIN-RAG: Multi-Agent Filtering Retrieval- Augmented Generation. InProceedings of the 63rd Annual Meeting of the Associa- tion for Computational Linguistics (ACL). 2607–2622

2025
[4]

Hyeong Kyu Choi, Xiaojin Zhu, and Sharon Li. 2025. Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models?. InAdvances in Neural Information Processing Systems (NeurIPS)

2025
[5]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think You Have Solved Question Answer- ing? Try ARC, the AI2 Reasoning Challenge.arXiv preprint arXiv:1803.05457 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems.arXiv preprint arXiv:2110.14168(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Tenenbaum, and Igor Mor- datch

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mor- datch. 2024. Improving Factuality and Reasoning in Language Models through Multiagent Debate. InProceedings of the 41st International Conference on Machine Learning (ICML)

2024
[8]

Andrew Estornell and Yang Liu. 2024. Multi-LLM Debate: Framework, Princi- ples, and Interventions. InAdvances in Neural Information Processing Systems (NeurIPS)

2024
[9]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Un- derstanding. InProceedings of the 9th International Conference on Learning Repre- sentations (ICLR)

2021
[10]

Tianyu Hu, Zixiang Tan, Shuaiqi Wang, Huiying Qu, and Tianyi Chen. 2025. Multi- Agent Debate for LLM Judges with Adaptive Stability Detection. InAdvances in Neural Information Processing Systems (NeurIPS)

2025
[11]

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 30. 3146–3154

2017
[12]

Elliot Kim, Avi Garg, Kenny Peng, and Nikhil Garg. 2025. Correlated Errors in Large Language Models. InProceedings of the 42nd International Conference on Machine Learning (ICML)

2025
[13]

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. 2024. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). 17889–17904

2024
[14]

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2024. Let’s Verify Step by Step. InProceedings of the 12th International Conference on Learning Representations (ICLR)

2024
[15]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring How Models Mimic Human Falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL). 3214–3252

2022
[16]

1976.Social Influence and Social Change

Serge Moscovici. 1976.Social Influence and Social Change. Academic Press, London

1976
[17]

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. WinoGrande: An Adversarial Winograd Schema Challenge at Scale.Commun. ACM64, 9 (2021), 99–106

2021
[18]

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Com- monsenseQA: A Question Answering Challenge Targeting World Knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)

2019
[19]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. InProceedings of the 11th International Conference on Learning Representations (ICLR)

2023
[20]

Haolun Wu, Zhenkun Li, and Lingyao Li. 2025. Can LLM Agents Really Debate? A Controlled Study of Multi-Agent Debate in Logical Reasoning.arXiv preprint arXiv:2511.07784(2025)

work page arXiv 2025
[21]

Wei Yang, Shixuan Li, Heng Ping, Peiyu Zhang, Paul Bogdan, and Jesse Thomason
[22]

Auditing Multi-Agent LLM Reasoning Trees Outperforms Majority Vote and LLM-as-Judge.arXiv preprint arXiv:2602.09341(2026)

work page arXiv 2026
[23]

Gonzalez, and Ion Stoica

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36. 46595–46623. A Prompt Templates This appendix ...

2023

[1] [1]

Rui Ai, Yuqi Pan, David Simchi-Levi, Milind Tambe, and Haifeng Xu. 2025. Beyond Majority Voting: LLM Aggregation by Leveraging Higher-Order Information. arXiv preprint arXiv:2510.01499(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2024. ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate. InProceedings of the 12th International Conference on Learning Representations (ICLR)

2024

[3] [3]

Chia-Yuan Chang, Zhimeng Jiang, Vineeth Rakesh, Menghai Pan, Chin- Chia Michael Yeh, Guanchu Wang, Mingzhi Hu, Zhichao Xu, Yan Zheng, Ma- hashweta Das, and Na Zou. 2025. MAIN-RAG: Multi-Agent Filtering Retrieval- Augmented Generation. InProceedings of the 63rd Annual Meeting of the Associa- tion for Computational Linguistics (ACL). 2607–2622

2025

[4] [4]

Hyeong Kyu Choi, Xiaojin Zhu, and Sharon Li. 2025. Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models?. InAdvances in Neural Information Processing Systems (NeurIPS)

2025

[5] [5]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think You Have Solved Question Answer- ing? Try ARC, the AI2 Reasoning Challenge.arXiv preprint arXiv:1803.05457 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems.arXiv preprint arXiv:2110.14168(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Tenenbaum, and Igor Mor- datch

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mor- datch. 2024. Improving Factuality and Reasoning in Language Models through Multiagent Debate. InProceedings of the 41st International Conference on Machine Learning (ICML)

2024

[8] [8]

Andrew Estornell and Yang Liu. 2024. Multi-LLM Debate: Framework, Princi- ples, and Interventions. InAdvances in Neural Information Processing Systems (NeurIPS)

2024

[9] [9]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Un- derstanding. InProceedings of the 9th International Conference on Learning Repre- sentations (ICLR)

2021

[10] [10]

Tianyu Hu, Zixiang Tan, Shuaiqi Wang, Huiying Qu, and Tianyi Chen. 2025. Multi- Agent Debate for LLM Judges with Adaptive Stability Detection. InAdvances in Neural Information Processing Systems (NeurIPS)

2025

[11] [11]

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 30. 3146–3154

2017

[12] [12]

Elliot Kim, Avi Garg, Kenny Peng, and Nikhil Garg. 2025. Correlated Errors in Large Language Models. InProceedings of the 42nd International Conference on Machine Learning (ICML)

2025

[13] [13]

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. 2024. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). 17889–17904

2024

[14] [14]

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2024. Let’s Verify Step by Step. InProceedings of the 12th International Conference on Learning Representations (ICLR)

2024

[15] [15]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring How Models Mimic Human Falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL). 3214–3252

2022

[16] [16]

1976.Social Influence and Social Change

Serge Moscovici. 1976.Social Influence and Social Change. Academic Press, London

1976

[17] [17]

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. WinoGrande: An Adversarial Winograd Schema Challenge at Scale.Commun. ACM64, 9 (2021), 99–106

2021

[18] [18]

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Com- monsenseQA: A Question Answering Challenge Targeting World Knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)

2019

[19] [19]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. InProceedings of the 11th International Conference on Learning Representations (ICLR)

2023

[20] [20]

Haolun Wu, Zhenkun Li, and Lingyao Li. 2025. Can LLM Agents Really Debate? A Controlled Study of Multi-Agent Debate in Logical Reasoning.arXiv preprint arXiv:2511.07784(2025)

work page arXiv 2025

[21] [21]

Wei Yang, Shixuan Li, Heng Ping, Peiyu Zhang, Paul Bogdan, and Jesse Thomason

[22] [22]

Auditing Multi-Agent LLM Reasoning Trees Outperforms Majority Vote and LLM-as-Judge.arXiv preprint arXiv:2602.09341(2026)

work page arXiv 2026

[23] [23]

Gonzalez, and Ion Stoica

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36. 46595–46623. A Prompt Templates This appendix ...

2023