LLMs are not (consistently) Bayesian: Quantifying internal (in)consistencies of LLMs' probabilistic beliefs
Pith reviewed 2026-06-30 23:03 UTC · model grok-4.3
The pith
LLMs update beliefs through a mix of near-Bayesian rules and learned heuristics, with the heuristics often outperforming exact Bayesian computation on tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMs do not update their probabilistic beliefs consistently according to Bayesian principles; different ways of incorporating evidence produce either nearly Bayesian updates or learned heuristic updates, and the heuristic updates often achieve higher downstream task performance than exact Bayesian computation, which reveals that the models' probabilistic representations are misspecified.
What carries the argument
The information processing gap, which compares outputs from multiple evidence incorporation methods to expose inconsistencies in how LLMs revise probabilistic beliefs.
If this is right
- Certain evidence incorporation methods yield belief updates that closely match Bayesian rules.
- Heuristic updates that deviate from Bayes can still improve performance on tasks that require probabilistic reasoning.
- The information processing gap provides a practical diagnostic for detecting problems in systems that rely on LLMs for inference under uncertainty.
Where Pith is reading between the lines
- Training procedures that reward consistency with Bayesian updating might reduce the performance gap between heuristic and exact methods.
- The same inconsistency patterns could appear in other domains that require sequential belief revision, such as medical diagnosis or legal reasoning.
- Developers could use the gap measure to select or fine-tune models for applications where calibrated uncertainty matters.
Load-bearing premise
The tested evidence incorporation methods reflect the LLMs' actual internal updating processes and are not mainly shaped by prompt format or training data.
What would settle it
A controlled test in which every evidence incorporation method produces identical belief updates that exactly match the results of Bayesian calculation on the same inputs across several models and tasks.
read the original abstract
Modern AI systems are being deployed in complex domains such as medicine, science, and law, where it is important that they not only produce correct answers, but also represent and update uncertain beliefs about the world as new evidence arrives. We introduce the novel technique of studying LLMs as information processing rules and utilize the information processing gap to study the internal (in)consistencies of how LLMs update their probabilistic beliefs from evidence. Our extensive experiments evaluate multiple approaches in which LLMs can incorporate evidence into their beliefs. Some of these approaches produce (nearly) Bayesian updates; others seem to use a learned heuristic. Surprisingly, the non-Bayesian heuristic updates often outperform exact Bayesian computation in terms of downstream task performance -- indicating the LLMs' probabilistic models of the world are misspecified. Lastly, we show how our measure can provide diagnostics to identify issues with LLM-powered inferential systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the 'information processing gap' technique for studying LLMs as information processing rules to quantify internal (in)consistencies in probabilistic belief updates from evidence. Experiments test multiple evidence incorporation approaches, identifying some as nearly Bayesian and others as learned heuristics; non-Bayesian heuristics are reported to often outperform exact Bayesian computation on downstream tasks, taken as evidence of misspecification in LLMs' world models. The measure is also positioned as a diagnostic for issues in LLM-powered inferential systems.
Significance. If the experimental controls and results hold after addressing elicitation confounds, the work would be significant for understanding limitations of LLMs in uncertain reasoning and for providing practical diagnostics in high-stakes applications. The reported outperformance of heuristics over exact Bayesian updates is a noteworthy observation that, if robust, would indicate systematic misspecification rather than mere inconsistency.
major comments (2)
- [Abstract] Abstract: the claim that non-Bayesian heuristic updates often outperform exact Bayesian computation in downstream task performance is presented without any reported datasets, sample sizes, statistical tests, or controls. This absence directly undermines evaluation of whether the data support the central claims of internal inconsistencies and world-model misspecification.
- [Methods / Experimental Setup] Experimental design (methods section): the information processing gap is used to distinguish nearly-Bayesian from heuristic updates, yet no controls are described for prompt-format invariance (e.g., verbalization of priors/likelihoods, chain-of-thought instructions, or output format). Without such controls, differences attributed to internal representations may instead reflect elicitation artifacts, rendering the Bayesian-vs-heuristic classification and the downstream-performance comparison load-bearing but insecure.
minor comments (1)
- [Introduction / §2] Define the information processing gap with an explicit equation or pseudocode in the main text (rather than deferring to an appendix) to improve clarity of the core metric.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of presentation and experimental robustness. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that non-Bayesian heuristic updates often outperform exact Bayesian computation in downstream task performance is presented without any reported datasets, sample sizes, statistical tests, or controls. This absence directly undermines evaluation of whether the data support the central claims of internal inconsistencies and world-model misspecification.
Authors: The abstract summarizes a finding whose supporting details (datasets, sample sizes, and statistical comparisons) appear in the main text. To improve immediate evaluability of the claim, we will revise the abstract to include a brief qualifier referencing the experimental basis. revision: yes
-
Referee: [Methods / Experimental Setup] Experimental design (methods section): the information processing gap is used to distinguish nearly-Bayesian from heuristic updates, yet no controls are described for prompt-format invariance (e.g., verbalization of priors/likelihoods, chain-of-thought instructions, or output format). Without such controls, differences attributed to internal representations may instead reflect elicitation artifacts, rendering the Bayesian-vs-heuristic classification and the downstream-performance comparison load-bearing but insecure.
Authors: This concern about elicitation artifacts is well-taken. Our experiments used fixed prompt templates, but we did not report systematic invariance tests. We will expand the methods section to document prompt standardization and add robustness analyses testing sensitivity to verbalization and output format. revision: yes
Circularity Check
No circularity: empirical evaluation of evidence incorporation methods
full rationale
The paper introduces the 'information processing gap' as a novel empirical technique for studying LLM belief updates and reports experimental comparisons across multiple incorporation approaches. Some yield near-Bayesian behavior while others appear heuristic, with downstream performance comparisons used to infer model misspecification. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations are present in the abstract or described methodology. The central claims rest on observable differences in task performance rather than any reduction to inputs by construction. This is a standard empirical study with no detectable circular steps.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning
Humans and LLMs exhibit similar error patterns in common-sense reasoning, consistent with shared pattern-matching mechanisms rather than abstract world models.
Reference graph
Works this paper leans on
-
[1]
Tibshirani and Larry Wasserman , title =
doi: 10.1080/01621459.2017.1285773. Peter G Brodeur, Thomas A Buckley, Zahir Kanjee, Ethan Goh, Evelyn Bin Ling, Priyank Jain, Stephanie Cabral, Raja-Elie Abdulnour, Adrian D Haimovich, Jason A Freed, et al. Performance of a large language model on the reasoning tasks of a physician.Science, 392(6797):524–527,
-
[2]
Fatemeh Dehghani, Roya Dehghani, Yazdan Naderzadeh Ardebili, and Shahryar Rahnamayan
URLhttps://arxiv.org/abs/2508.21184. Fatemeh Dehghani, Roya Dehghani, Yazdan Naderzadeh Ardebili, and Shahryar Rahnamayan. Large language models in legal systems: A survey.Humanities and Social Sciences Communications, 12(1):1977,
Pith/arXiv arXiv 1977
-
[3]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger
URLhttps://arxiv.org/abs/2510.20886. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InInterna- tional Conference on Machine Learning, pages 1321–1330. PMLR,
-
[4]
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning
doi: 10.1038/s41586-025-09422-z. Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14): 6421,
-
[5]
URLhttps://arxiv.org/abs/2601.08584. David Louapre. Can LLMs play the game of science?,
-
[6]
Aliakbar Nafar, Kristen Brent Venable, and Parisa Kordjamshidi
URLhttps://arxiv.org/abs/1911.02210. Aliakbar Nafar, Kristen Brent Venable, and Parisa Kordjamshidi. Reasoning over uncertain text by generative large language models. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty- Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on ...
arXiv 1911
-
[7]
doi: 10.1609/aaai.v39i23.34674
ISBN 978-1-57735-897-8. doi: 10.1609/aaai.v39i23.34674. URLhttps://doi.org/10.1609/aaai.v39i23.34674. Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan Carlson, Matthew P Lungren, Bay Gross, Peter Hames, Mustafa Suleyman, Dominic King, and Eric Horvitz. Sequential ...
-
[8]
URLhttps://arxiv.org/abs/2511.13240. Theodore Papamarkou, Pierre Alquier, Matthias Bauer, Wray Buntine, Andrew Davison, Gintare Karolina Dziugaite, Maurizio Filippone, Andrew Y. K. Foong, Vincent Fortuin, Dimitris Fouskakis, Eyke Hüllermeier, Theofanis Kar- aletsos, Mohammad Emtiyaz Khan, Nikita Kotelevskii, Salem Lahlou, Yingzhen Li, Fang Liu, Clare Lyle...
-
[9]
URLhttps://ssrn.com/abstract=6143772
doi: 10.2139/ssrn.6143772. URLhttps://ssrn.com/abstract=6143772. ESSEC Business School Research Paper, Forthcoming. Linlu Qiu, Fei Sha, Kelsey Allen, Yoon Kim, Tal Linzen, and Sjoerd van Steenkiste. Bayesian teaching enables probabilistic reasoning in large language models.Nature Communications,
-
[10]
Timo Pierre Schrader, Lukas Lange, Simon Razniewski, and Annemarie Friedrich
URLhttps://arxiv.org/abs/2303.17548. Timo Pierre Schrader, Lukas Lange, Simon Razniewski, and Annemarie Friedrich. QUITE: Quantifying uncertainty in natural language text in Bayesian reasoning scenarios. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pag...
arXiv 2024
-
[11]
Fang, J., Jiang, H., Wang, K., Ma, Y ., Shi, J., Wang, X., He, X., and Chua, T
Association for Computational Linguistics. doi: 10.18653/v1/ 2024.emnlp-main.153. URLhttps://aclanthology.org/2024.emnlp-main.153/. Omar Shaikh, Shardul Sapkota, Shan Rizvi, Eric Horvitz, Joon Sung Park, Diyi Yang, and Michael S Bernstein. Creating general user models from computer use. InProceedings of the 38th Annual ACM Symposium on User Interface Soft...
-
[12]
Just ask for calibration: Strategies for eliciting calibrated confidence scores from lan- guage models fine-tuned with human feedback
Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from lan- guage models fine-tuned with human feedback. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Pro- ceedings of the 2023 Conference on Em...
2023
-
[13]
doi: 10.18653/v1/2023.emnlp-main.330
Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.330. URL https://aclanthology.org/2023.emnlp-main.330/. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian...
-
[14]
URLhttps://arxiv.org/abs/2505.09388. Arnold Zellner. Optimal information processing and Bayes’s Theorem.The American Statistician, 42(4):278–280,
-
[15]
Each row represents a species and its associated attribute values
15 Table A2Animals dataset. Each row represents a species and its associated attribute values. Species Color F ood Habitat Locomotion Reproduces Active Exterior Pattern Social structure Size Extremities Lion brown vertebrate grassland terrestrial live birth nocturnal fur solid pack large paws Tiger orange vertebrate forest terrestrial live birth nocturnal...
1991
-
[16]
The full dataset is shown in Tab
to generate 50 species names (such that multiple species share the same phylogenetic group) and their values for 11 prespecified attributes – e.g., color, habitat, etc. The full dataset is shown in Tab. A2. We generate a set of 500 questions each with four answer choices by selecting a target species and then selecting three other options at random with a...
2023
-
[17]
Based on the responses, what is the individual’s political ideology?
which was originally developed to study whose opinions were being encoded into LLMs. We build our dataset by identifying ten waves of the questions each with a different theme to provide different information about individuals. We chose ten waves to capture a broad set of question themes without too much overlap and to maintain enough overlapping responde...
2024
-
[18]
We analyze the 78 game traces in the benchmark that were generated by GPT-5.2 with high reasoning
and evaluate how efficiently the LLMs process the information. We analyze the 78 game traces in the benchmark that were generated by GPT-5.2 with high reasoning. The traces contained 4–31 evidence steps. Note that we do not evaluate GPT-5.2’s information processing capabilities. Since the card-playing strategy of the original player determines which evide...
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.