pith. sign in

arxiv: 2605.30391 · v1 · pith:3FY5N5TQnew · submitted 2026-05-28 · 💻 cs.MA · cs.AI· cs.CL

Social Reasoning in Machines: Investigating Collective Truth-Seeking Dynamics in Large Language Model Debate

Pith reviewed 2026-06-29 00:01 UTC · model grok-4.3

classification 💻 cs.MA cs.AIcs.CL
keywords multi-agent debateargumentative theory of reasoningcollective truth-seekingepistemic diversityLLM truth-seekinghallucination measurement
0
0 comments X

The pith

When LLMs are engineered for epistemic diversity, their multi-agent adversarial debates improve truth-seeking on questionnaires even if individuals perform poorly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies the Argumentative Theory of Reasoning to large language models by setting up multi-agent debates that pit models against one another. It finds that this collective process yields higher accuracy on questionnaire tasks than any single model achieves alone. The gains appear tied to the adversarial structure rather than mere aggregation or prompting tricks. The work further uses debate dynamics to propose new ways of measuring model traits such as hallucination that static tests miss.

Core claim

Simulating ATR through LLM multi-agent debate shows that an epistemically diverse set of models produces significantly better truth-seeking results on questionnaires than isolated models, with the improvement mechanistically linked to adversarial debate principles, indicating collective reasoning can be advantageous beyond biological systems.

What carries the argument

LLM-MAD, the multi-agent debate protocol in which LLMs engage in adversarial discourse engineered for epistemic diversity to replicate ATR conditions.

Load-bearing premise

Performance gains come specifically from ATR-style adversarial debate rather than model selection, prompting, or statistical aggregation.

What would settle it

A controlled experiment that holds model selection and prompting fixed while removing the adversarial debate format and still observes the same accuracy gains would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.30391 by Tom Pecher.

Figure 1
Figure 1. Figure 1: Experiment 1 results Experiment 1 with varying temperature shows that models begin to converge very slightly as the number of rounds increases (Figure 1a). Overall, we see that increasing temperature reduces score as expected, but also that the models slowly close this gap over time. Performance appears to peak around T = 0.5, which aligns with the value of T = 0.7 that is generally considered the industry… view at source ↗
Figure 2
Figure 2. Figure 2: Experiment 2 results Introducing sham critique into the debate drastically alters the observed dynamics for both tests. In the temperature experiment (Figure 2a), we see that performance stays constant for all models. Most notably, the same improvement observed in higher-temperature models (T = 1.5) is no longer observed outside of a very minor positive trend. The low variance of each model’s results shows… view at source ↗
Figure 3
Figure 3. Figure 3: Experiment 3 results The solo debate ablation yielded some of the most interesting results in this experimental phase, as the two tests showed very different dynamics. The temperature results (Figure 3a) are surprisingly similar to the baseline. The same trend in improvement for T = 1.5 can clearly be observed, although this improvement is visibly slower and steadier than the baseline results. It is theref… view at source ↗
Figure 4
Figure 4. Figure 4: Experiment 4 results Little can be inferred from the dynamics of the no-revisions temperature experiment (Figure 4a). The results are vaguely similar to baseline, with the main difference being an overall increase in individual model variance. Some slight improvement can be observed in the high-temperature models, however this is notably more volatile and is not consistently sustained. These fluctuations a… view at source ↗
Figure 5
Figure 5. Figure 5: Experiment 7 results Results from Experiment 7 show a complete absence of any truth-seeking dynamics: for all agent sizes, the group’s performance remains largely unchanged as the number of rounds increases. This clearly shows how limited homogeneous LLM-MAD truly is, revealing nothing that could not already be inferred from the initial QA. If anything, the agents’ performance actually degrades over time (… view at source ↗
Figure 6
Figure 6. Figure 6: Experiment 8 results There is much that can be learnt from the results of Experiment 8. Whilst the overall performances across datasets vary significantly (which is to be expected), we see that the agents perform similarly relative to each other. This allows us to intuitively rank the models’ behaviour visually and form a solid foundation for potential numerical benchmarks. Again, results such as those in … view at source ↗
Figure 7
Figure 7. Figure 7: Experiment 9 results 33 [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Experiment 10 results As we have seen in Experiment 9, adding many models of similar size will shift the group’s performance in one direction. However, upon adding agents of roughly symmetrical size (each small model is paired with a large one), these effects appear to cancel out, resulting in minimal change in the overall group performance. Observing the results in [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Results of Proposed Benchmarking Methodology [PITH_FULL_IMAGE:figures/full_fig_p038_9.png] view at source ↗
read the original abstract

Human reasoning has long been theorised to operate socially, not through isolated individual cognition, but through collective adversarial discourse, a framework known as the Argumentative Theory of Reasoning (ATR). Rather than relying on individual "intellectualist reasoners" as the primary vehicle for truth-seeking, ATR reconceptualises truth as an emergent property of social epistemology: the product of imperfect individual reasoning refined under the adversarial pressure of debate. This distributed method of collective intelligence has guided humanity to ever-greater epistemic heights and underpins the foundational principles of all democratic systems. This thesis breaks new ground by, for the first time, simulating ATR through the multi-agent debate (MAD) of large language models (LLMs). With rigorous empirical analysis, we demonstrate that, when correctly engineering an epistemically diverse set of models, LLM-MAD can significantly improve truth-seeking performance on questionnaire-based tasks, even when individual debate participants exhibit limited standalone performance. Furthermore, we present strong empirical evidence that this performance gain is mechanistically grounded in the central principles of ATR, suggesting that collective reasoning may be universally favourable over individualist reasoning, rather than a quirk in biology or evolution. Finally, drawing on our analysis of debate dynamics, we propose a novel benchmarking methodology that leverages LLM-MAD to measure intrinsic model properties (such as hallucination propensity) in order to compare models in ways that current static benchmarking approaches cannot support.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript claims to be the first simulation of the Argumentative Theory of Reasoning (ATR) via multi-agent debate (MAD) among large language models. It asserts that engineering an epistemically diverse set of LLMs enables LLM-MAD to significantly improve truth-seeking performance on questionnaire-based tasks even when individual models show limited standalone performance, that this gain is mechanistically grounded in ATR principles, and that debate dynamics can be used to propose a novel benchmarking methodology for measuring intrinsic model properties such as hallucination propensity.

Significance. If the empirical claims were supported by detailed methods, controls, ablations, and statistical evidence, the work would offer notable significance for multi-agent systems research by linking cognitive science theories to LLM collective reasoning and by suggesting new evaluation paradigms beyond static benchmarks. It could inform designs for improving truthfulness in AI through adversarial mechanisms.

major comments (2)
  1. [Abstract] Abstract: The abstract states claims of 'rigorous empirical analysis' and 'strong empirical evidence' that LLM-MAD improves truth-seeking and that gains are mechanistically grounded in ATR, but supplies no experimental details, model specifications, tasks, diversity metrics, controls, ablations, or statistical tests. Without these, the central empirical claims cannot be evaluated.
  2. [Abstract] Abstract: The assertion that performance gains are 'mechanistically grounded in the central principles of ATR' risks circularity, as no independent verification of the mechanism (e.g., explicit metrics for epistemic diversity or debate dynamics separate from the performance outcome) is described; diversity could be defined or selected post-hoc to match observed results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments on the abstract. We address each point below and clarify how the manuscript supports its claims with details provided in the full text.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract states claims of 'rigorous empirical analysis' and 'strong empirical evidence' that LLM-MAD improves truth-seeking and that gains are mechanistically grounded in ATR, but supplies no experimental details, model specifications, tasks, diversity metrics, controls, ablations, or statistical tests. Without these, the central empirical claims cannot be evaluated.

    Authors: The abstract is intentionally concise as a high-level summary of contributions. The full manuscript details the experimental setup in dedicated Methods and Results sections, including specific LLMs used, questionnaire tasks, epistemic diversity metrics (defined via pre-debate disagreement and knowledge variance), controls, ablations, and statistical tests. These elements are reported with full transparency to allow evaluation of the claims. revision: partial

  2. Referee: [Abstract] Abstract: The assertion that performance gains are 'mechanistically grounded in the central principles of ATR' risks circularity, as no independent verification of the mechanism (e.g., explicit metrics for epistemic diversity or debate dynamics separate from the performance outcome) is described; diversity could be defined or selected post-hoc to match observed results.

    Authors: Epistemic diversity is operationalized independently via pre-defined metrics on model disagreement patterns and knowledge base differences measured before any debate occurs. Debate dynamics (argument exchange, convergence rates) are tracked and analyzed separately from final accuracy outcomes in dedicated analysis sections. This separation provides independent verification that observed gains align with ATR mechanisms rather than post-hoc fitting. revision: no

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The provided abstract and context contain no derivation chain, equations, fitted parameters called predictions, or self-citations that reduce any claimed result to its inputs by construction. Claims rest on empirical demonstration of performance gains from engineered epistemic diversity in LLM-MAD, with no quoted reduction (e.g., no self-definitional loop or ansatz smuggled via prior work) visible in the text. The full manuscript is referenced but not supplied, precluding identification of any load-bearing circular step under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no concrete free parameters, axioms, or invented entities beyond the background reliance on ATR; full paper required for ledger population.

axioms (1)
  • domain assumption Argumentative Theory of Reasoning accurately describes human truth-seeking
    The paper takes ATR as the explanatory framework for why collective debate improves performance.

pith-pipeline@v0.9.1-grok · 5777 in / 1233 out tokens · 51250 ms · 2026-06-29T00:01:04.275580+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    Why do humans reason? arguments for an argumentative theory

    Hugo Mercier and Dan Sperber. Why do humans reason? arguments for an argumentative theory. Behavioral and brain sciences, 34(2):57–74, 2011

  2. [2]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models.CoRR, abs/2201.11903, 2022

  3. [3]

    Simple and scalable predictive uncertainty estimation using deep ensembles, 2017

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles, 2017

  4. [4]

    Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, March 2023

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, March 2023

  5. [5]

    Truthfulqa: Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022

  6. [6]

    Halueval: A large-scale hallucination evaluation benchmark for large language models

    Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 6449–6464, 2023

  7. [7]

    Gaming truthfulqa: Simple heuristics exposed - dataset weaknesses

    Alex Turner and Mark Kurzeja. Gaming truthfulqa: Simple heuristics exposed - dataset weaknesses. https://turntrout.com/original-truthfulqa-weaknesses, 2025. Accessed: 2025-11-12

  8. [8]

    Reasoning, biases and dual processes: The lasting impact of wason (1960)

    Jonathan St BT Evans. Reasoning, biases and dual processes: The lasting impact of wason (1960). Quarterly journal of experimental psychology, 69(10):2076–2092, 2016

  9. [9]

    Stop fooling yourself! (diagnosing and treating confirmation bias).ENeuro, 11(10), 2024

    Richard T Born. Stop fooling yourself! (diagnosing and treating confirmation bias).ENeuro, 11(10), 2024

  10. [10]

    Harvard university press, 2017

    Hugo Mercier and Dan Sperber.The enigma of reason. Harvard university press, 2017

  11. [11]

    The selective laziness of reasoning

    Emmanuel Trouche, Petter Johansson, Lars Hall, and Hugo Mercier. The selective laziness of reasoning. Cognitive science, 40(8):2122–2136, 2016

  12. [12]

    Reasoning about a rule.Quarterly journal of experimental psychology, 20(3):273–281, 1968

    Peter C Wason. Reasoning about a rule.Quarterly journal of experimental psychology, 20(3):273–281, 1968

  13. [13]

    Collaborative reasoning: Evidence for collective rationality.Thinking & Reasoning, 4(3):231–248, 1998

    David Moshman and Molly Geil. Collaborative reasoning: Evidence for collective rationality.Thinking & Reasoning, 4(3):231–248, 1998

  14. [14]

    Stanovich and Richard F

    Keith E. Stanovich and Richard F. West. Individual differences in reasoning: Implications for the rationality debate?Behavioral and Brain Sciences, 23(5):645–665, 2000

  15. [15]

    The future of reasoning

    Michael Stevens. The future of reasoning. YouTube, April 2021. Available at:https://youtu.be/ ˙ArVh3Cj9rw(Accessed on 08/12/2025)

  16. [16]

    Does the chimpanzee have a theory of mind?Behavioral and brain sciences, 1(4):515–526, 1978

    David Premack and Guy Woodruff. Does the chimpanzee have a theory of mind?Behavioral and brain sciences, 1(4):515–526, 1978

  17. [17]

    Metarepresentations in an evolutionary perspective.Metarepresentations: A multidisciplinary perspective, 10:117–137, 2000

    Dan Sperber. Metarepresentations in an evolutionary perspective.Metarepresentations: A multidisciplinary perspective, 10:117–137, 2000

  18. [18]

    Epistemic vigilance.Mind & language, 25(4):359–393, 2010

    Dan Sperber, Fabrice Cl´ ement, Christophe Heintz, Olivier Mascaro, Hugo Mercier, Gloria Origgi, and Deirdre Wilson. Epistemic vigilance.Mind & language, 25(4):359–393, 2010. 42

  19. [19]

    Argumentation evolved: But how? coevolution of coordinated group behavior and reasoning

    Fabian Seitz. Argumentation evolved: But how? coevolution of coordinated group behavior and reasoning. Argumentation, 34(2):237–260, 2020

  20. [20]

    On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games.Artificial intelligence, 77(2):321–357, 1995

    Phan Minh Dung. On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games.Artificial intelligence, 77(2):321–357, 1995

  21. [21]

    The aspic+ framework for structured argumentation: a tutorial

    Sanjay Modgil and Henry Prakken. The aspic+ framework for structured argumentation: a tutorial. Argument & Computation, 5(1):31–62, 2014

  22. [22]

    An assumption-based framework for non-monotonic reasoning

    Andrei Bondarenko, Francesca Toni, and Robert A Kowalski. An assumption-based framework for non-monotonic reasoning. InLPNMR, volume 93, pages 171–189, 1993

  23. [23]

    Latent debate: A surrogate framework for interpreting llm thinking, 2026

    Lihu Chen, Xiang Yin, and Francesca Toni. Latent debate: A surrogate framework for interpreting llm thinking, 2026

  24. [24]

    Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

    Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014

  25. [25]

    Ai safety via debate, 2018

    Geoffrey Irving, Paul Christiano, and Dario Amodei. Ai safety via debate, 2018

  26. [26]

    Mechanism design for abstract argumentation

    Iyad Rahwan and Kate Larson. Mechanism design for abstract argumentation. In7th International Joint Conference on Autonomous Agents and Multiagent Systems, pages 1031–1038. International Foundation for Autonomous Agents and Multiagent Systems, 2008

  27. [27]

    Argmed-agents: Explainable clinical decision reasoning with llm disscusion via argumentation schemes

    Shengxin Hong, Liang Xiao, Xin Zhang, and Jianxia Chen. Argmed-agents: Explainable clinical decision reasoning with llm disscusion via argumentation schemes. In2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 5486–5493. IEEE, 2024

  28. [28]

    Theory of mind for multi-agent collaboration via large language models

    Huao Li, Yu Chong, Simon Stepputtis, Joseph P Campbell, Dana Hughes, Charles Lewis, and Katia Sycara. Theory of mind for multi-agent collaboration via large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 180–192, 2023

  29. [29]

    Peacemaker or troublemaker: How sycophancy shapes multi-agent debate, 2025

    Binwei Yao, Chao Shang, Wanyu Du, Jianfeng He, Ruixue Lian, Yi Zhang, Hang Su, Sandesh Swamy, and Yanjun Qi. Peacemaker or troublemaker: How sycophancy shapes multi-agent debate, 2025

  30. [30]

    Improving factuality and reasoning in language models through multiagent debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InForty-first international conference on machine learning, 2024

  31. [31]

    Yan Zhou and Yanguang Chen. Adaptive heterogeneous multi-agent debate for enhanced educational and factual reasoning in large language models.Journal of King Saud University Computer and Information Sciences, 37(10):330, 2025

  32. [32]

    Encouraging divergent thinking in large language models through multi-agent debate

    Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 17889–17904, 2024

  33. [33]

    Chateval: Towards better llm-based evaluators through multi-agent debate, 2023

    Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate, 2023

  34. [34]

    Barrett, and Arnu Pretorius

    Andries Smit, Paul Duckworth, Nathan Grinsztajn, Thomas D. Barrett, and Arnu Pretorius. Should we be going mad? a look at multi-agent debate strategies for llms, 2024

  35. [35]

    A debate-driven experiment on llm hallucinations and accuracy, 2024

    Ray Li, Tanishka Bagade, Kevin Martinez, Flora Yasmin, Grant Ayala, Michael Lam, and Kevin Zhu. A debate-driven experiment on llm hallucinations and accuracy, 2024. 43

  36. [36]

    Interpreting and mitigating hallucination in mllms through multi-agent debate, 2024

    Zheng Lin, Zhenxing Niu, Zhibin Wang, and Yinghui Xu. Interpreting and mitigating hallucination in mllms through multi-agent debate, 2024

  37. [37]

    Enhancing hallucination detection in large language models through a dual-position debate multi-agent framework.International Conference on Intelligent Computing, 2025

    Qile He and Siting Le. Enhancing hallucination detection in large language models through a dual-position debate multi-agent framework.International Conference on Intelligent Computing, 2025

  38. [38]

    Confidencecal: Enhancing llms reliability through confidence calibration in multi-agent debate

    Yilin Bai. Confidencecal: Enhancing llms reliability through confidence calibration in multi-agent debate. In2024 10th International Conference on Big Data and Information Analytics (BigDIA), pages 221–226. IEEE, 2024

  39. [39]

    Training verifiers to solve math word problems, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

  40. [40]

    The hidden strength of disagreement: Unraveling the consensus-diversity tradeoff in adaptive multi-agent systems

    Zengqing Wu and Takayuki Ito. The hidden strength of disagreement: Unraveling the consensus-diversity tradeoff in adaptive multi-agent systems. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15288–15308, 2025

  41. [41]

    The truth becomes clearer through debate! multi-agent systems with large language models unmask fake news

    Yuhan Liu, Yuxuan Liu, Xiaoqing Zhang, Xiuying Chen, and Rui Yan. The truth becomes clearer through debate! multi-agent systems with large language models unmask fake news. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 504–514, 2025

  42. [42]

    When truth is overridden: Uncovering the internal origins of sycophancy in large language models

    Keyu Wang, Jin Li, Shu Yang, Zhuoran Zhang, and Di Wang. When truth is overridden: Uncovering the internal origins of sycophancy in large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33566–33574, 2026

  43. [43]

    Improving multi-agent debate with sparse communication topology

    Yunxuan Li, Yibing Du, Jiageng Zhang, Le Hou, Peter Grabowski, Yeqing Li, and Eugene Ie. Improving multi-agent debate with sparse communication topology. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 7281–7294, 2024

  44. [44]

    Measuring massive multitask language understanding, 2021

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021

  45. [45]

    Hellaswag: Can a machine really finish your sentence?, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019

  46. [46]

    Llm benchmarks: The saturation and contamination problem, 2024

    LXT.ai. Llm benchmarks: The saturation and contamination problem, 2024. Available at:https: //www.lxt.ai/blog/llm-benchmarks/(Accessed on 1/5/2026)

  47. [47]

    Livebench: A challenging, contamination-limited llm benchmark, 2025

    Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination-limited llm benchmark, 2025

  48. [48]

    Vempala, and Edwin Zhang

    Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. Why language models hallucinate, 2025

  49. [49]

    Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, and Philip E. Tetlock. Forecastbench: A dynamic benchmark of ai forecasting capabilities, 2025

  50. [50]

    Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023

  51. [51]

    Snow White and the Seven Dwarfs

    Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline, 2024. 44 Appendices Appendix I: Prompt Templates Initial QA Prompt Example THE FOLLOWING IS A LOG OF QUESTIONS YOU HA VE ANSWERED, CRITIQUES OF YOUR ANSWER...