Social Reasoning in Machines: Investigating Collective Truth-Seeking Dynamics in Large Language Model Debate

Tom Pecher

arxiv: 2605.30391 · v1 · pith:3FY5N5TQnew · submitted 2026-05-28 · 💻 cs.MA · cs.AI· cs.CL

Social Reasoning in Machines: Investigating Collective Truth-Seeking Dynamics in Large Language Model Debate

Tom Pecher This is my paper

Pith reviewed 2026-06-29 00:01 UTC · model grok-4.3

classification 💻 cs.MA cs.AIcs.CL

keywords multi-agent debateargumentative theory of reasoningcollective truth-seekingepistemic diversityLLM truth-seekinghallucination measurement

0 comments

The pith

When LLMs are engineered for epistemic diversity, their multi-agent adversarial debates improve truth-seeking on questionnaires even if individuals perform poorly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies the Argumentative Theory of Reasoning to large language models by setting up multi-agent debates that pit models against one another. It finds that this collective process yields higher accuracy on questionnaire tasks than any single model achieves alone. The gains appear tied to the adversarial structure rather than mere aggregation or prompting tricks. The work further uses debate dynamics to propose new ways of measuring model traits such as hallucination that static tests miss.

Core claim

Simulating ATR through LLM multi-agent debate shows that an epistemically diverse set of models produces significantly better truth-seeking results on questionnaires than isolated models, with the improvement mechanistically linked to adversarial debate principles, indicating collective reasoning can be advantageous beyond biological systems.

What carries the argument

LLM-MAD, the multi-agent debate protocol in which LLMs engage in adversarial discourse engineered for epistemic diversity to replicate ATR conditions.

Load-bearing premise

Performance gains come specifically from ATR-style adversarial debate rather than model selection, prompting, or statistical aggregation.

What would settle it

A controlled experiment that holds model selection and prompting fixed while removing the adversarial debate format and still observes the same accuracy gains would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.30391 by Tom Pecher.

**Figure 1.** Figure 1: Experiment 1 results Experiment 1 with varying temperature shows that models begin to converge very slightly as the number of rounds increases (Figure 1a). Overall, we see that increasing temperature reduces score as expected, but also that the models slowly close this gap over time. Performance appears to peak around T = 0.5, which aligns with the value of T = 0.7 that is generally considered the industry… view at source ↗

**Figure 2.** Figure 2: Experiment 2 results Introducing sham critique into the debate drastically alters the observed dynamics for both tests. In the temperature experiment (Figure 2a), we see that performance stays constant for all models. Most notably, the same improvement observed in higher-temperature models (T = 1.5) is no longer observed outside of a very minor positive trend. The low variance of each model’s results shows… view at source ↗

**Figure 3.** Figure 3: Experiment 3 results The solo debate ablation yielded some of the most interesting results in this experimental phase, as the two tests showed very different dynamics. The temperature results (Figure 3a) are surprisingly similar to the baseline. The same trend in improvement for T = 1.5 can clearly be observed, although this improvement is visibly slower and steadier than the baseline results. It is theref… view at source ↗

**Figure 4.** Figure 4: Experiment 4 results Little can be inferred from the dynamics of the no-revisions temperature experiment (Figure 4a). The results are vaguely similar to baseline, with the main difference being an overall increase in individual model variance. Some slight improvement can be observed in the high-temperature models, however this is notably more volatile and is not consistently sustained. These fluctuations a… view at source ↗

**Figure 5.** Figure 5: Experiment 7 results Results from Experiment 7 show a complete absence of any truth-seeking dynamics: for all agent sizes, the group’s performance remains largely unchanged as the number of rounds increases. This clearly shows how limited homogeneous LLM-MAD truly is, revealing nothing that could not already be inferred from the initial QA. If anything, the agents’ performance actually degrades over time (… view at source ↗

**Figure 6.** Figure 6: Experiment 8 results There is much that can be learnt from the results of Experiment 8. Whilst the overall performances across datasets vary significantly (which is to be expected), we see that the agents perform similarly relative to each other. This allows us to intuitively rank the models’ behaviour visually and form a solid foundation for potential numerical benchmarks. Again, results such as those in … view at source ↗

**Figure 7.** Figure 7: Experiment 9 results 33 [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗

**Figure 8.** Figure 8: Experiment 10 results As we have seen in Experiment 9, adding many models of similar size will shift the group’s performance in one direction. However, upon adding agents of roughly symmetrical size (each small model is paired with a large one), these effects appear to cancel out, resulting in minimal change in the overall group performance. Observing the results in [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗

**Figure 9.** Figure 9: Results of Proposed Benchmarking Methodology [PITH_FULL_IMAGE:figures/full_fig_p038_9.png] view at source ↗

read the original abstract

Human reasoning has long been theorised to operate socially, not through isolated individual cognition, but through collective adversarial discourse, a framework known as the Argumentative Theory of Reasoning (ATR). Rather than relying on individual "intellectualist reasoners" as the primary vehicle for truth-seeking, ATR reconceptualises truth as an emergent property of social epistemology: the product of imperfect individual reasoning refined under the adversarial pressure of debate. This distributed method of collective intelligence has guided humanity to ever-greater epistemic heights and underpins the foundational principles of all democratic systems. This thesis breaks new ground by, for the first time, simulating ATR through the multi-agent debate (MAD) of large language models (LLMs). With rigorous empirical analysis, we demonstrate that, when correctly engineering an epistemically diverse set of models, LLM-MAD can significantly improve truth-seeking performance on questionnaire-based tasks, even when individual debate participants exhibit limited standalone performance. Furthermore, we present strong empirical evidence that this performance gain is mechanistically grounded in the central principles of ATR, suggesting that collective reasoning may be universally favourable over individualist reasoning, rather than a quirk in biology or evolution. Finally, drawing on our analysis of debate dynamics, we propose a novel benchmarking methodology that leverages LLM-MAD to measure intrinsic model properties (such as hallucination propensity) in order to compare models in ways that current static benchmarking approaches cannot support.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims to be the first simulation of ATR via LLM multi-agent debate with performance gains from epistemic diversity, but supplies no methods, data, or controls to support any of it.

read the letter

The main thing to know is that the abstract asserts LLM multi-agent debate improves truth-seeking on questionnaires when models are made epistemically diverse, and that this is mechanistically due to ATR principles rather than other factors. It also proposes using debate dynamics as a new way to benchmark things like hallucination. Without the full paper's methods or results, none of that can be checked.

What is new is the explicit framing of LLM debate as a test of ATR and the suggestion that collective setups might be generally better than individual ones for truth-seeking. The connection to the cognitive science literature on argumentative reasoning is a reasonable move and gives the work a clear conceptual anchor.

The soft spots are the lack of any experimental details. The abstract talks about rigorous analysis and strong evidence for mechanistic grounding, but there are no model specs, controls for prompting or aggregation effects, diversity metrics, statistical tests, or ablations. This makes the central claim about ATR-style mechanics untestable and risks circularity if diversity is tuned to the outcome. The benchmarking proposal is interesting in principle but also rests on the same unshown results.

This is for people already working on multi-agent LLM systems who want to explore social-reasoning angles. A reader wanting reproducible findings or clear evidence on the mechanism will get little value until the experiments are presented.

I would not recommend sending this to peer review in its current form. It needs the methods, results, and verification of the ATR link before it is ready for referees.

Referee Report

2 major / 0 minor

Summary. The manuscript claims to be the first simulation of the Argumentative Theory of Reasoning (ATR) via multi-agent debate (MAD) among large language models. It asserts that engineering an epistemically diverse set of LLMs enables LLM-MAD to significantly improve truth-seeking performance on questionnaire-based tasks even when individual models show limited standalone performance, that this gain is mechanistically grounded in ATR principles, and that debate dynamics can be used to propose a novel benchmarking methodology for measuring intrinsic model properties such as hallucination propensity.

Significance. If the empirical claims were supported by detailed methods, controls, ablations, and statistical evidence, the work would offer notable significance for multi-agent systems research by linking cognitive science theories to LLM collective reasoning and by suggesting new evaluation paradigms beyond static benchmarks. It could inform designs for improving truthfulness in AI through adversarial mechanisms.

major comments (2)

[Abstract] Abstract: The abstract states claims of 'rigorous empirical analysis' and 'strong empirical evidence' that LLM-MAD improves truth-seeking and that gains are mechanistically grounded in ATR, but supplies no experimental details, model specifications, tasks, diversity metrics, controls, ablations, or statistical tests. Without these, the central empirical claims cannot be evaluated.
[Abstract] Abstract: The assertion that performance gains are 'mechanistically grounded in the central principles of ATR' risks circularity, as no independent verification of the mechanism (e.g., explicit metrics for epistemic diversity or debate dynamics separate from the performance outcome) is described; diversity could be defined or selected post-hoc to match observed results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments on the abstract. We address each point below and clarify how the manuscript supports its claims with details provided in the full text.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states claims of 'rigorous empirical analysis' and 'strong empirical evidence' that LLM-MAD improves truth-seeking and that gains are mechanistically grounded in ATR, but supplies no experimental details, model specifications, tasks, diversity metrics, controls, ablations, or statistical tests. Without these, the central empirical claims cannot be evaluated.

Authors: The abstract is intentionally concise as a high-level summary of contributions. The full manuscript details the experimental setup in dedicated Methods and Results sections, including specific LLMs used, questionnaire tasks, epistemic diversity metrics (defined via pre-debate disagreement and knowledge variance), controls, ablations, and statistical tests. These elements are reported with full transparency to allow evaluation of the claims. revision: partial
Referee: [Abstract] Abstract: The assertion that performance gains are 'mechanistically grounded in the central principles of ATR' risks circularity, as no independent verification of the mechanism (e.g., explicit metrics for epistemic diversity or debate dynamics separate from the performance outcome) is described; diversity could be defined or selected post-hoc to match observed results.

Authors: Epistemic diversity is operationalized independently via pre-defined metrics on model disagreement patterns and knowledge base differences measured before any debate occurs. Debate dynamics (argument exchange, convergence rates) are tracked and analyzed separately from final accuracy outcomes in dedicated analysis sections. This separation provides independent verification that observed gains align with ATR mechanisms rather than post-hoc fitting. revision: no

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The provided abstract and context contain no derivation chain, equations, fitted parameters called predictions, or self-citations that reduce any claimed result to its inputs by construction. Claims rest on empirical demonstration of performance gains from engineered epistemic diversity in LLM-MAD, with no quoted reduction (e.g., no self-definitional loop or ansatz smuggled via prior work) visible in the text. The full manuscript is referenced but not supplied, precluding identification of any load-bearing circular step under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no concrete free parameters, axioms, or invented entities beyond the background reliance on ATR; full paper required for ledger population.

axioms (1)

domain assumption Argumentative Theory of Reasoning accurately describes human truth-seeking
The paper takes ATR as the explanatory framework for why collective debate improves performance.

pith-pipeline@v0.9.1-grok · 5777 in / 1233 out tokens · 51250 ms · 2026-06-29T00:01:04.275580+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Why do humans reason? arguments for an argumentative theory

Hugo Mercier and Dan Sperber. Why do humans reason? arguments for an argumentative theory. Behavioral and brain sciences, 34(2):57–74, 2011

2011
[2]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models.CoRR, abs/2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Simple and scalable predictive uncertainty estimation using deep ensembles, 2017

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles, 2017

2017
[4]

Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, March 2023

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, March 2023

2023
[5]

Truthfulqa: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022

2022
[6]

Halueval: A large-scale hallucination evaluation benchmark for large language models

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 6449–6464, 2023

2023
[7]

Gaming truthfulqa: Simple heuristics exposed - dataset weaknesses

Alex Turner and Mark Kurzeja. Gaming truthfulqa: Simple heuristics exposed - dataset weaknesses. https://turntrout.com/original-truthfulqa-weaknesses, 2025. Accessed: 2025-11-12

2025
[8]

Reasoning, biases and dual processes: The lasting impact of wason (1960)

Jonathan St BT Evans. Reasoning, biases and dual processes: The lasting impact of wason (1960). Quarterly journal of experimental psychology, 69(10):2076–2092, 2016

1960
[9]

Stop fooling yourself! (diagnosing and treating confirmation bias).ENeuro, 11(10), 2024

Richard T Born. Stop fooling yourself! (diagnosing and treating confirmation bias).ENeuro, 11(10), 2024

2024
[10]

Harvard university press, 2017

Hugo Mercier and Dan Sperber.The enigma of reason. Harvard university press, 2017

2017
[11]

The selective laziness of reasoning

Emmanuel Trouche, Petter Johansson, Lars Hall, and Hugo Mercier. The selective laziness of reasoning. Cognitive science, 40(8):2122–2136, 2016

2016
[12]

Reasoning about a rule.Quarterly journal of experimental psychology, 20(3):273–281, 1968

Peter C Wason. Reasoning about a rule.Quarterly journal of experimental psychology, 20(3):273–281, 1968

1968
[13]

Collaborative reasoning: Evidence for collective rationality.Thinking & Reasoning, 4(3):231–248, 1998

David Moshman and Molly Geil. Collaborative reasoning: Evidence for collective rationality.Thinking & Reasoning, 4(3):231–248, 1998

1998
[14]

Stanovich and Richard F

Keith E. Stanovich and Richard F. West. Individual differences in reasoning: Implications for the rationality debate?Behavioral and Brain Sciences, 23(5):645–665, 2000

2000
[15]

The future of reasoning

Michael Stevens. The future of reasoning. YouTube, April 2021. Available at:https://youtu.be/ ˙ArVh3Cj9rw(Accessed on 08/12/2025)

2021
[16]

Does the chimpanzee have a theory of mind?Behavioral and brain sciences, 1(4):515–526, 1978

David Premack and Guy Woodruff. Does the chimpanzee have a theory of mind?Behavioral and brain sciences, 1(4):515–526, 1978

1978
[17]

Metarepresentations in an evolutionary perspective.Metarepresentations: A multidisciplinary perspective, 10:117–137, 2000

Dan Sperber. Metarepresentations in an evolutionary perspective.Metarepresentations: A multidisciplinary perspective, 10:117–137, 2000

2000
[18]

Epistemic vigilance.Mind & language, 25(4):359–393, 2010

Dan Sperber, Fabrice Cl´ ement, Christophe Heintz, Olivier Mascaro, Hugo Mercier, Gloria Origgi, and Deirdre Wilson. Epistemic vigilance.Mind & language, 25(4):359–393, 2010. 42

2010
[19]

Argumentation evolved: But how? coevolution of coordinated group behavior and reasoning

Fabian Seitz. Argumentation evolved: But how? coevolution of coordinated group behavior and reasoning. Argumentation, 34(2):237–260, 2020

2020
[20]

On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games.Artificial intelligence, 77(2):321–357, 1995

Phan Minh Dung. On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games.Artificial intelligence, 77(2):321–357, 1995

1995
[21]

The aspic+ framework for structured argumentation: a tutorial

Sanjay Modgil and Henry Prakken. The aspic+ framework for structured argumentation: a tutorial. Argument & Computation, 5(1):31–62, 2014

2014
[22]

An assumption-based framework for non-monotonic reasoning

Andrei Bondarenko, Francesca Toni, and Robert A Kowalski. An assumption-based framework for non-monotonic reasoning. InLPNMR, volume 93, pages 171–189, 1993

1993
[23]

Latent debate: A surrogate framework for interpreting llm thinking, 2026

Lihu Chen, Xiang Yin, and Francesca Toni. Latent debate: A surrogate framework for interpreting llm thinking, 2026

2026
[24]

Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014

2014
[25]

Ai safety via debate, 2018

Geoffrey Irving, Paul Christiano, and Dario Amodei. Ai safety via debate, 2018

2018
[26]

Mechanism design for abstract argumentation

Iyad Rahwan and Kate Larson. Mechanism design for abstract argumentation. In7th International Joint Conference on Autonomous Agents and Multiagent Systems, pages 1031–1038. International Foundation for Autonomous Agents and Multiagent Systems, 2008

2008
[27]

Argmed-agents: Explainable clinical decision reasoning with llm disscusion via argumentation schemes

Shengxin Hong, Liang Xiao, Xin Zhang, and Jianxia Chen. Argmed-agents: Explainable clinical decision reasoning with llm disscusion via argumentation schemes. In2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 5486–5493. IEEE, 2024

2024
[28]

Theory of mind for multi-agent collaboration via large language models

Huao Li, Yu Chong, Simon Stepputtis, Joseph P Campbell, Dana Hughes, Charles Lewis, and Katia Sycara. Theory of mind for multi-agent collaboration via large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 180–192, 2023

2023
[29]

Peacemaker or troublemaker: How sycophancy shapes multi-agent debate, 2025

Binwei Yao, Chao Shang, Wanyu Du, Jianfeng He, Ruixue Lian, Yi Zhang, Hang Su, Sandesh Swamy, and Yanjun Qi. Peacemaker or troublemaker: How sycophancy shapes multi-agent debate, 2025

2025
[30]

Improving factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InForty-first international conference on machine learning, 2024

2024
[31]

Yan Zhou and Yanguang Chen. Adaptive heterogeneous multi-agent debate for enhanced educational and factual reasoning in large language models.Journal of King Saud University Computer and Information Sciences, 37(10):330, 2025

2025
[32]

Encouraging divergent thinking in large language models through multi-agent debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 17889–17904, 2024

2024
[33]

Chateval: Towards better llm-based evaluators through multi-agent debate, 2023

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate, 2023

2023
[34]

Barrett, and Arnu Pretorius

Andries Smit, Paul Duckworth, Nathan Grinsztajn, Thomas D. Barrett, and Arnu Pretorius. Should we be going mad? a look at multi-agent debate strategies for llms, 2024

2024
[35]

A debate-driven experiment on llm hallucinations and accuracy, 2024

Ray Li, Tanishka Bagade, Kevin Martinez, Flora Yasmin, Grant Ayala, Michael Lam, and Kevin Zhu. A debate-driven experiment on llm hallucinations and accuracy, 2024. 43

2024
[36]

Interpreting and mitigating hallucination in mllms through multi-agent debate, 2024

Zheng Lin, Zhenxing Niu, Zhibin Wang, and Yinghui Xu. Interpreting and mitigating hallucination in mllms through multi-agent debate, 2024

2024
[37]

Enhancing hallucination detection in large language models through a dual-position debate multi-agent framework.International Conference on Intelligent Computing, 2025

Qile He and Siting Le. Enhancing hallucination detection in large language models through a dual-position debate multi-agent framework.International Conference on Intelligent Computing, 2025

2025
[38]

Confidencecal: Enhancing llms reliability through confidence calibration in multi-agent debate

Yilin Bai. Confidencecal: Enhancing llms reliability through confidence calibration in multi-agent debate. In2024 10th International Conference on Big Data and Information Analytics (BigDIA), pages 221–226. IEEE, 2024

2024
[39]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

2021
[40]

The hidden strength of disagreement: Unraveling the consensus-diversity tradeoff in adaptive multi-agent systems

Zengqing Wu and Takayuki Ito. The hidden strength of disagreement: Unraveling the consensus-diversity tradeoff in adaptive multi-agent systems. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15288–15308, 2025

2025
[41]

The truth becomes clearer through debate! multi-agent systems with large language models unmask fake news

Yuhan Liu, Yuxuan Liu, Xiaoqing Zhang, Xiuying Chen, and Rui Yan. The truth becomes clearer through debate! multi-agent systems with large language models unmask fake news. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 504–514, 2025

2025
[42]

When truth is overridden: Uncovering the internal origins of sycophancy in large language models

Keyu Wang, Jin Li, Shu Yang, Zhuoran Zhang, and Di Wang. When truth is overridden: Uncovering the internal origins of sycophancy in large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33566–33574, 2026

2026
[43]

Improving multi-agent debate with sparse communication topology

Yunxuan Li, Yibing Du, Jiageng Zhang, Le Hou, Peter Grabowski, Yeqing Li, and Eugene Ie. Improving multi-agent debate with sparse communication topology. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 7281–7294, 2024

2024
[44]

Measuring massive multitask language understanding, 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021

2021
[45]

Hellaswag: Can a machine really finish your sentence?, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019

2019
[46]

Llm benchmarks: The saturation and contamination problem, 2024

LXT.ai. Llm benchmarks: The saturation and contamination problem, 2024. Available at:https: //www.lxt.ai/blog/llm-benchmarks/(Accessed on 1/5/2026)

2024
[47]

Livebench: A challenging, contamination-limited llm benchmark, 2025

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination-limited llm benchmark, 2025

2025
[48]

Vempala, and Edwin Zhang

Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. Why language models hallucinate, 2025

2025
[49]

Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, and Philip E. Tetlock. Forecastbench: A dynamic benchmark of ai forecasting capabilities, 2025

2025
[50]

Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023

2023
[51]

Snow White and the Seven Dwarfs

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline, 2024. 44 Appendices Appendix I: Prompt Templates Initial QA Prompt Example THE FOLLOWING IS A LOG OF QUESTIONS YOU HA VE ANSWERED, CRITIQUES OF YOUR ANSWER...

2024

[1] [1]

Why do humans reason? arguments for an argumentative theory

Hugo Mercier and Dan Sperber. Why do humans reason? arguments for an argumentative theory. Behavioral and brain sciences, 34(2):57–74, 2011

2011

[2] [2]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models.CoRR, abs/2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Simple and scalable predictive uncertainty estimation using deep ensembles, 2017

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles, 2017

2017

[4] [4]

Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, March 2023

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, March 2023

2023

[5] [5]

Truthfulqa: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022

2022

[6] [6]

Halueval: A large-scale hallucination evaluation benchmark for large language models

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 6449–6464, 2023

2023

[7] [7]

Gaming truthfulqa: Simple heuristics exposed - dataset weaknesses

Alex Turner and Mark Kurzeja. Gaming truthfulqa: Simple heuristics exposed - dataset weaknesses. https://turntrout.com/original-truthfulqa-weaknesses, 2025. Accessed: 2025-11-12

2025

[8] [8]

Reasoning, biases and dual processes: The lasting impact of wason (1960)

Jonathan St BT Evans. Reasoning, biases and dual processes: The lasting impact of wason (1960). Quarterly journal of experimental psychology, 69(10):2076–2092, 2016

1960

[9] [9]

Stop fooling yourself! (diagnosing and treating confirmation bias).ENeuro, 11(10), 2024

Richard T Born. Stop fooling yourself! (diagnosing and treating confirmation bias).ENeuro, 11(10), 2024

2024

[10] [10]

Harvard university press, 2017

Hugo Mercier and Dan Sperber.The enigma of reason. Harvard university press, 2017

2017

[11] [11]

The selective laziness of reasoning

Emmanuel Trouche, Petter Johansson, Lars Hall, and Hugo Mercier. The selective laziness of reasoning. Cognitive science, 40(8):2122–2136, 2016

2016

[12] [12]

Reasoning about a rule.Quarterly journal of experimental psychology, 20(3):273–281, 1968

Peter C Wason. Reasoning about a rule.Quarterly journal of experimental psychology, 20(3):273–281, 1968

1968

[13] [13]

Collaborative reasoning: Evidence for collective rationality.Thinking & Reasoning, 4(3):231–248, 1998

David Moshman and Molly Geil. Collaborative reasoning: Evidence for collective rationality.Thinking & Reasoning, 4(3):231–248, 1998

1998

[14] [14]

Stanovich and Richard F

Keith E. Stanovich and Richard F. West. Individual differences in reasoning: Implications for the rationality debate?Behavioral and Brain Sciences, 23(5):645–665, 2000

2000

[15] [15]

The future of reasoning

Michael Stevens. The future of reasoning. YouTube, April 2021. Available at:https://youtu.be/ ˙ArVh3Cj9rw(Accessed on 08/12/2025)

2021

[16] [16]

Does the chimpanzee have a theory of mind?Behavioral and brain sciences, 1(4):515–526, 1978

David Premack and Guy Woodruff. Does the chimpanzee have a theory of mind?Behavioral and brain sciences, 1(4):515–526, 1978

1978

[17] [17]

Metarepresentations in an evolutionary perspective.Metarepresentations: A multidisciplinary perspective, 10:117–137, 2000

Dan Sperber. Metarepresentations in an evolutionary perspective.Metarepresentations: A multidisciplinary perspective, 10:117–137, 2000

2000

[18] [18]

Epistemic vigilance.Mind & language, 25(4):359–393, 2010

Dan Sperber, Fabrice Cl´ ement, Christophe Heintz, Olivier Mascaro, Hugo Mercier, Gloria Origgi, and Deirdre Wilson. Epistemic vigilance.Mind & language, 25(4):359–393, 2010. 42

2010

[19] [19]

Argumentation evolved: But how? coevolution of coordinated group behavior and reasoning

Fabian Seitz. Argumentation evolved: But how? coevolution of coordinated group behavior and reasoning. Argumentation, 34(2):237–260, 2020

2020

[20] [20]

On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games.Artificial intelligence, 77(2):321–357, 1995

Phan Minh Dung. On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games.Artificial intelligence, 77(2):321–357, 1995

1995

[21] [21]

The aspic+ framework for structured argumentation: a tutorial

Sanjay Modgil and Henry Prakken. The aspic+ framework for structured argumentation: a tutorial. Argument & Computation, 5(1):31–62, 2014

2014

[22] [22]

An assumption-based framework for non-monotonic reasoning

Andrei Bondarenko, Francesca Toni, and Robert A Kowalski. An assumption-based framework for non-monotonic reasoning. InLPNMR, volume 93, pages 171–189, 1993

1993

[23] [23]

Latent debate: A surrogate framework for interpreting llm thinking, 2026

Lihu Chen, Xiang Yin, and Francesca Toni. Latent debate: A surrogate framework for interpreting llm thinking, 2026

2026

[24] [24]

Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014

2014

[25] [25]

Ai safety via debate, 2018

Geoffrey Irving, Paul Christiano, and Dario Amodei. Ai safety via debate, 2018

2018

[26] [26]

Mechanism design for abstract argumentation

Iyad Rahwan and Kate Larson. Mechanism design for abstract argumentation. In7th International Joint Conference on Autonomous Agents and Multiagent Systems, pages 1031–1038. International Foundation for Autonomous Agents and Multiagent Systems, 2008

2008

[27] [27]

Argmed-agents: Explainable clinical decision reasoning with llm disscusion via argumentation schemes

Shengxin Hong, Liang Xiao, Xin Zhang, and Jianxia Chen. Argmed-agents: Explainable clinical decision reasoning with llm disscusion via argumentation schemes. In2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 5486–5493. IEEE, 2024

2024

[28] [28]

Theory of mind for multi-agent collaboration via large language models

Huao Li, Yu Chong, Simon Stepputtis, Joseph P Campbell, Dana Hughes, Charles Lewis, and Katia Sycara. Theory of mind for multi-agent collaboration via large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 180–192, 2023

2023

[29] [29]

Peacemaker or troublemaker: How sycophancy shapes multi-agent debate, 2025

Binwei Yao, Chao Shang, Wanyu Du, Jianfeng He, Ruixue Lian, Yi Zhang, Hang Su, Sandesh Swamy, and Yanjun Qi. Peacemaker or troublemaker: How sycophancy shapes multi-agent debate, 2025

2025

[30] [30]

Improving factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InForty-first international conference on machine learning, 2024

2024

[31] [31]

Yan Zhou and Yanguang Chen. Adaptive heterogeneous multi-agent debate for enhanced educational and factual reasoning in large language models.Journal of King Saud University Computer and Information Sciences, 37(10):330, 2025

2025

[32] [32]

Encouraging divergent thinking in large language models through multi-agent debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 17889–17904, 2024

2024

[33] [33]

Chateval: Towards better llm-based evaluators through multi-agent debate, 2023

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate, 2023

2023

[34] [34]

Barrett, and Arnu Pretorius

Andries Smit, Paul Duckworth, Nathan Grinsztajn, Thomas D. Barrett, and Arnu Pretorius. Should we be going mad? a look at multi-agent debate strategies for llms, 2024

2024

[35] [35]

A debate-driven experiment on llm hallucinations and accuracy, 2024

Ray Li, Tanishka Bagade, Kevin Martinez, Flora Yasmin, Grant Ayala, Michael Lam, and Kevin Zhu. A debate-driven experiment on llm hallucinations and accuracy, 2024. 43

2024

[36] [36]

Interpreting and mitigating hallucination in mllms through multi-agent debate, 2024

Zheng Lin, Zhenxing Niu, Zhibin Wang, and Yinghui Xu. Interpreting and mitigating hallucination in mllms through multi-agent debate, 2024

2024

[37] [37]

Enhancing hallucination detection in large language models through a dual-position debate multi-agent framework.International Conference on Intelligent Computing, 2025

Qile He and Siting Le. Enhancing hallucination detection in large language models through a dual-position debate multi-agent framework.International Conference on Intelligent Computing, 2025

2025

[38] [38]

Confidencecal: Enhancing llms reliability through confidence calibration in multi-agent debate

Yilin Bai. Confidencecal: Enhancing llms reliability through confidence calibration in multi-agent debate. In2024 10th International Conference on Big Data and Information Analytics (BigDIA), pages 221–226. IEEE, 2024

2024

[39] [39]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

2021

[40] [40]

The hidden strength of disagreement: Unraveling the consensus-diversity tradeoff in adaptive multi-agent systems

Zengqing Wu and Takayuki Ito. The hidden strength of disagreement: Unraveling the consensus-diversity tradeoff in adaptive multi-agent systems. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15288–15308, 2025

2025

[41] [41]

The truth becomes clearer through debate! multi-agent systems with large language models unmask fake news

Yuhan Liu, Yuxuan Liu, Xiaoqing Zhang, Xiuying Chen, and Rui Yan. The truth becomes clearer through debate! multi-agent systems with large language models unmask fake news. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 504–514, 2025

2025

[42] [42]

When truth is overridden: Uncovering the internal origins of sycophancy in large language models

Keyu Wang, Jin Li, Shu Yang, Zhuoran Zhang, and Di Wang. When truth is overridden: Uncovering the internal origins of sycophancy in large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33566–33574, 2026

2026

[43] [43]

Improving multi-agent debate with sparse communication topology

Yunxuan Li, Yibing Du, Jiageng Zhang, Le Hou, Peter Grabowski, Yeqing Li, and Eugene Ie. Improving multi-agent debate with sparse communication topology. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 7281–7294, 2024

2024

[44] [44]

Measuring massive multitask language understanding, 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021

2021

[45] [45]

Hellaswag: Can a machine really finish your sentence?, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019

2019

[46] [46]

Llm benchmarks: The saturation and contamination problem, 2024

LXT.ai. Llm benchmarks: The saturation and contamination problem, 2024. Available at:https: //www.lxt.ai/blog/llm-benchmarks/(Accessed on 1/5/2026)

2024

[47] [47]

Livebench: A challenging, contamination-limited llm benchmark, 2025

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination-limited llm benchmark, 2025

2025

[48] [48]

Vempala, and Edwin Zhang

Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. Why language models hallucinate, 2025

2025

[49] [49]

Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, and Philip E. Tetlock. Forecastbench: A dynamic benchmark of ai forecasting capabilities, 2025

2025

[50] [50]

Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023

2023

[51] [51]

Snow White and the Seven Dwarfs

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline, 2024. 44 Appendices Appendix I: Prompt Templates Initial QA Prompt Example THE FOLLOWING IS A LOG OF QUESTIONS YOU HA VE ANSWERED, CRITIQUES OF YOUR ANSWER...

2024