Recognition: 2 theorem links
· Lean TheoremLLMs are not (consistently) Bayesian: Quantifying internal (in)consistencies of LLMs' probabilistic beliefs
Pith reviewed 2026-05-11 00:49 UTC · model grok-4.3
The pith
Large language models do not consistently update probabilistic beliefs according to Bayesian rules.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMs are not consistently Bayesian: when evidence is incorporated, some methods produce nearly Bayesian belief updates while others follow learned heuristics, and the non-Bayesian heuristics frequently outperform exact Bayesian computation on tasks, indicating that the models' probabilistic world models are misspecified.
What carries the argument
The information processing gap, a quantitative measure of the inconsistency between an LLM's actual belief updates from evidence and the updates required by Bayesian probability.
If this is right
- Some methods for incorporating evidence into LLMs achieve nearly Bayesian updates.
- Other methods use learned heuristics that deviate from Bayesian standards.
- Heuristic-based updates can produce higher downstream task performance than exact Bayesian computation.
- The information processing gap serves as a diagnostic for identifying issues in LLM-powered inferential systems.
Where Pith is reading between the lines
- Forcing stricter Bayesian consistency might reduce rather than improve practical performance in some applications.
- The gap could be applied to test consistency in other forms of reasoning beyond probability.
- Training data may encourage approximate heuristics over exact inference rules.
Load-bearing premise
That LLMs maintain stable internal probabilistic beliefs which can be reliably elicited and that the information processing gap captures genuine inconsistencies rather than prompting artifacts.
What would settle it
Repeating the same evidence incorporation experiment with varied but logically equivalent prompt phrasings or interfaces and finding that the measured gaps and downstream performance rankings remain unchanged.
read the original abstract
Modern AI systems are being deployed in complex domains such as medicine, science, and law, where it is important that they not only produce correct answers, but also represent and update uncertain beliefs about the world as new evidence arrives. We introduce the novel technique of studying LLMs as information processing rules and utilize the information processing gap to study the internal (in)consistencies of how LLMs update their probabilistic beliefs from evidence. Our extensive experiments evaluate multiple approaches in which LLMs can incorporate evidence into their beliefs. Some of these approaches produce (nearly) Bayesian updates; others seem to use a learned heuristic. Surprisingly, the non-Bayesian heuristic updates often outperform exact Bayesian computation in terms of downstream task performance -- indicating the LLMs' probabilistic models of the world are misspecified. Lastly, we show how our measure can provide diagnostics to identify issues with LLM-powered inferential systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the 'information processing gap' to quantify inconsistencies in how LLMs update probabilistic beliefs upon receiving evidence. It evaluates multiple evidence-incorporation methods via prompting, finding that some produce nearly Bayesian updates while others rely on learned heuristics; surprisingly, the heuristic updates often outperform exact Bayesian computation on downstream tasks, which the authors interpret as evidence that LLMs' internal world models are misspecified. The work also positions the gap as a diagnostic tool for LLM-powered inferential systems.
Significance. If the prompted elicitations reliably reflect stable internal probabilistic representations rather than surface artifacts, the results would be significant for understanding and improving LLM reasoning under uncertainty in domains like medicine and science. The empirical comparison of update rules and the counterintuitive finding that non-Bayesian heuristics can outperform exact Bayes provide a concrete, falsifiable lens on model misspecification that goes beyond standard accuracy benchmarks.
major comments (2)
- [§3 and §4] §3 (Method) and §4 (Experiments): The information processing gap is defined by comparing a direct prompted posterior to a Bayesian posterior computed from separately elicited prior and likelihood. This comparison is load-bearing for the central claim that some updates are 'nearly Bayesian' while others are heuristic, yet the paper reports no systematic robustness checks across elicitation variants (e.g., different phrasings, order of evidence presentation, or temperature settings). Without such controls, the gap and the downstream performance advantage of heuristics could reflect prompt sensitivity rather than internal (in)consistency.
- [§4.3] §4.3 (Downstream task evaluation): The claim that non-Bayesian heuristic updates 'often outperform exact Bayesian computation' is central to the misspecification interpretation. However, the reported effect sizes and statistical significance are not broken down by task or model scale, and it is unclear whether the Bayesian baseline uses the same elicited quantities or an oracle prior; this makes it difficult to isolate whether the advantage stems from misspecification or from differences in how evidence is represented.
minor comments (2)
- [Eq. (2)] Notation for the information processing gap (Eq. 2) could be clarified by explicitly stating whether the gap is an absolute difference, KL divergence, or another metric, and by providing the exact formula used for the Bayesian posterior computation.
- [Figure 3] Figure 3 (or equivalent) showing update consistency across methods would benefit from error bars or confidence intervals to allow readers to assess the reliability of the 'nearly Bayesian' classification.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the robustness and interpretability of our results on the information processing gap. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3 and §4] §3 (Method) and §4 (Experiments): The information processing gap is defined by comparing a direct prompted posterior to a Bayesian posterior computed from separately elicited prior and likelihood. This comparison is load-bearing for the central claim that some updates are 'nearly Bayesian' while others are heuristic, yet the paper reports no systematic robustness checks across elicitation variants (e.g., different phrasings, order of evidence presentation, or temperature settings). Without such controls, the gap and the downstream performance advantage of heuristics could reflect prompt sensitivity rather than internal (in)consistency.
Authors: We agree that systematic robustness checks are essential to ensure the information processing gap captures internal (in)consistencies rather than elicitation artifacts. While the original experiments used a consistent prompting protocol across models and tasks, we did not report exhaustive variations. In the revised manuscript, we will add a dedicated robustness subsection to §4 that includes: (i) multiple alternative phrasings for prior, likelihood, and posterior elicitation; (ii) variations in evidence presentation order; and (iii) results across temperatures from 0.0 to 1.0. We will quantify how the gap and downstream performance differences vary (or remain stable) under these conditions, with statistical summaries of sensitivity. revision: yes
-
Referee: [§4.3] §4.3 (Downstream task evaluation): The claim that non-Bayesian heuristic updates 'often outperform exact Bayesian computation' is central to the misspecification interpretation. However, the reported effect sizes and statistical significance are not broken down by task or model scale, and it is unclear whether the Bayesian baseline uses the same elicited quantities or an oracle prior; this makes it difficult to isolate whether the advantage stems from misspecification or from differences in how evidence is represented.
Authors: We appreciate the request for greater transparency in the downstream evaluation. The Bayesian baseline throughout §4.3 is computed from the same elicited priors and likelihoods used for the prompted posteriors (not an oracle). To address the lack of breakdowns, the revision will expand §4.3 with per-task and per-scale tables reporting: mean performance differences, effect sizes (Cohen's d), and p-values from paired statistical tests. These will be disaggregated by task domain (e.g., medical, scientific) and model size, allowing clearer assessment of whether the heuristic advantage is consistent with misspecification. revision: yes
Circularity Check
Empirical evaluation of LLM belief updates with no self-referential derivations or fitted predictions
full rationale
The paper frames its contribution as an empirical study introducing the 'information processing gap' to compare LLM evidence incorporation methods against exact Bayesian updates. No load-bearing mathematical derivations, parameter fits renamed as predictions, or self-citation chains appear in the abstract or described methodology. All central claims rest on experimental comparisons of prompted outputs versus computed baselines, which are externally falsifiable and do not reduce to the paper's own inputs by construction. This is the expected non-circular outcome for a purely evaluative work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs maintain internal probabilistic beliefs that can be accessed and compared to ideal Bayesian updates through prompting or other interfaces.
invented entities (1)
-
information processing gap
no independent evidence
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel matchesΔ(q) ≜ I_out − I_in = D_KL(q||p) ≥ 0 ... zero only when the post-data distribution is obtained via Bayes’ theorem (Zellner, 1988).
-
Foundation.LogicAsFunctionalEquationTranslation Theorem / bilinear_family_forced echoesΔ(q) = D_KL(q||π) − I_LER ... the amount we move our beliefs ... should comport with how strong the evidence supports the hypothesis.
Reference graph
Works this paper leans on
-
[1]
M., KUCUKELBIR, A., MCAULIFFE, J
doi: 10.1080/01621459.2017.1285773. Peter G Brodeur, Thomas A Buckley, Zahir Kanjee, Ethan Goh, Evelyn Bin Ling, Priyank Jain, Stephanie Cabral, Raja-Elie Abdulnour, Adrian D Haimovich, Jason A Freed, et al. Performance of a large language model on the reasoning tasks of a physician.Science, 392(6797):524–527,
-
[2]
BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design
URLhttps://arxiv.org/abs/2508.21184. Fatemeh Dehghani, Roya Dehghani, Yazdan Naderzadeh Ardebili, and Shahryar Rahnamayan. Large language models in legal systems: A survey.Humanities and Social Sciences Communications, 12(1):1977,
work page internal anchor Pith review Pith/arXiv arXiv 1977
-
[3]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger
URLhttps://arxiv.org/abs/2510.20886. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InInterna- tional conference on machine learning, pages 1321–1330. PMLR,
-
[4]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584,
work page internal anchor Pith review arXiv
-
[6]
Aliakbar Nafar, Kristen Brent Venable, and Parisa Kordjamshidi
URLhttps://arxiv.org/abs/1911.02210. Aliakbar Nafar, Kristen Brent Venable, and Parisa Kordjamshidi. Reasoning over uncertain text by generative large language models. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty- Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on ...
-
[7]
doi: 10.1609/aaai.v39i23.34674
ISBN 978-1-57735-897-8. doi: 10.1609/aaai.v39i23.34674. URLhttps://doi.org/10.1609/aaai.v39i23.34674. 11 Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan Carlson, Matthew P Lungren, et al. Sequential diagnosis with language models. arXiv preprint arXiv:2506.22405,
-
[8]
Incoherent beliefs & inconsistent actions in large language models.arXiv preprint arXiv:2511.13240,
Arka Pal, Teo Kitanovski, Arthur Liang, Akilesh Potti, and Micah Goldblum. Incoherent beliefs & inconsistent actions in large language models.arXiv preprint arXiv:2511.13240,
-
[9]
URLhttps://ssrn.com/abstract=6143772
doi: 10.2139/ssrn.6143772. URLhttps://ssrn.com/abstract=6143772. ESSEC Business School Research Paper, Forthcoming. Linlu Qiu, Fei Sha, Kelsey Allen, Yoon Kim, Tal Linzen, and Sjoerd van Steenkiste. Bayesian teaching enables probabilistic reasoning in large language models.Nature Communications,
-
[10]
arXiv preprint arXiv:2303.17548 , year=
URLhttps://arxiv.org/abs/2303.17548. Timo Pierre Schrader, Lukas Lange, Simon Razniewski, and Annemarie Friedrich. QUITE: Quantifying uncertainty in natural language text in Bayesian reasoning scenarios. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pag...
-
[11]
Association for Computational Linguistics. doi: 10.18653/v1/ 2024.emnlp-main.153. URLhttps://aclanthology.org/2024.emnlp-main.153/. Omar Shaikh, Shardul Sapkota, Shan Rizvi, Eric Horvitz, Joon Sung Park, Diyi Yang, and Michael S Bernstein. Creating general user models from computer use. InProceedings of the 38th Annual ACM Symposium on User Interface Soft...
-
[12]
Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christo- pher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback.arXiv preprint arXiv:2305.14975,
-
[13]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
The following figure indicates the number of observations with evidence at the specified evidence step
Table 1Dataset Statistics Dataset Datapoints Options Evidence Steps 80% Evidence Step Animals with Attributes 500 4 11 11 OpinionQA 575 5 10 10 MediQ 494 ∗ 4–10 variable 4 Eleusis 78 5 4–31 11 ∗ 6 datapoints were filtered out because they had less than 3 evidence steps. The following figure indicates the number of observations with evidence at the specifi...
1991
-
[15]
We generate a set of 500 questions each with four answer choices by selecting a target species and then selecting three other options at random with at least one coming from the same family – e.g. bear. This construction mitigates the computational scaling of the Zellner computations with the number of options while also making the questions challenging. ...
2023
-
[16]
Based on the responses, what is the individual’s political ideology?
which was originally developed to study whose opinions were being encoded into LLMs. We build our dataset by identifying ten waves of the questions each with a different theme to provide different information about individuals. We chose ten waves to capture a broad set of question themes without too much overlap and to maintain enough overlapping responde...
2024
-
[17]
and evaluate how efficiently the LLMs process the information. Each split corresponds to game traces generated by a different LLM player from the original benchmark, yielding 12 splits of 78 games each, with between 4 and 31 evidence steps per trace. Note that the LLMs we evaluate are different from those that generated the game traces. Since the card-pla...
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.