pith. sign in

arxiv: 2605.06915 · v2 · pith:UN2O3Z3Znew · submitted 2026-05-07 · 💻 cs.LG

LLMs are not (consistently) Bayesian: Quantifying internal (in)consistencies of LLMs' probabilistic beliefs

Pith reviewed 2026-06-30 23:03 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLMsBayesian updatingprobabilistic beliefsinformation processing gapheuristicsmisspecificationbelief updatinginconsistencies
0
0 comments X

The pith

LLMs update beliefs through a mix of near-Bayesian rules and learned heuristics, with the heuristics often outperforming exact Bayesian computation on tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method for treating LLMs as information processing rules and uses the information processing gap between different evidence incorporation approaches to measure internal inconsistencies in probabilistic belief updates. Experiments across multiple methods show that some produce updates close to Bayesian while others follow a learned heuristic. The heuristic methods frequently deliver better results on downstream tasks than exact Bayesian calculation does. This pattern indicates that LLMs maintain misspecified internal models of the world. The same gap measure also serves as a diagnostic for problems in LLM-based inference systems.

Core claim

LLMs do not update their probabilistic beliefs consistently according to Bayesian principles; different ways of incorporating evidence produce either nearly Bayesian updates or learned heuristic updates, and the heuristic updates often achieve higher downstream task performance than exact Bayesian computation, which reveals that the models' probabilistic representations are misspecified.

What carries the argument

The information processing gap, which compares outputs from multiple evidence incorporation methods to expose inconsistencies in how LLMs revise probabilistic beliefs.

If this is right

  • Certain evidence incorporation methods yield belief updates that closely match Bayesian rules.
  • Heuristic updates that deviate from Bayes can still improve performance on tasks that require probabilistic reasoning.
  • The information processing gap provides a practical diagnostic for detecting problems in systems that rely on LLMs for inference under uncertainty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training procedures that reward consistency with Bayesian updating might reduce the performance gap between heuristic and exact methods.
  • The same inconsistency patterns could appear in other domains that require sequential belief revision, such as medical diagnosis or legal reasoning.
  • Developers could use the gap measure to select or fine-tune models for applications where calibrated uncertainty matters.

Load-bearing premise

The tested evidence incorporation methods reflect the LLMs' actual internal updating processes and are not mainly shaped by prompt format or training data.

What would settle it

A controlled test in which every evidence incorporation method produces identical belief updates that exactly match the results of Bayesian calculation on the same inputs across several models and tasks.

read the original abstract

Modern AI systems are being deployed in complex domains such as medicine, science, and law, where it is important that they not only produce correct answers, but also represent and update uncertain beliefs about the world as new evidence arrives. We introduce the novel technique of studying LLMs as information processing rules and utilize the information processing gap to study the internal (in)consistencies of how LLMs update their probabilistic beliefs from evidence. Our extensive experiments evaluate multiple approaches in which LLMs can incorporate evidence into their beliefs. Some of these approaches produce (nearly) Bayesian updates; others seem to use a learned heuristic. Surprisingly, the non-Bayesian heuristic updates often outperform exact Bayesian computation in terms of downstream task performance -- indicating the LLMs' probabilistic models of the world are misspecified. Lastly, we show how our measure can provide diagnostics to identify issues with LLM-powered inferential systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the 'information processing gap' technique for studying LLMs as information processing rules to quantify internal (in)consistencies in probabilistic belief updates from evidence. Experiments test multiple evidence incorporation approaches, identifying some as nearly Bayesian and others as learned heuristics; non-Bayesian heuristics are reported to often outperform exact Bayesian computation on downstream tasks, taken as evidence of misspecification in LLMs' world models. The measure is also positioned as a diagnostic for issues in LLM-powered inferential systems.

Significance. If the experimental controls and results hold after addressing elicitation confounds, the work would be significant for understanding limitations of LLMs in uncertain reasoning and for providing practical diagnostics in high-stakes applications. The reported outperformance of heuristics over exact Bayesian updates is a noteworthy observation that, if robust, would indicate systematic misspecification rather than mere inconsistency.

major comments (2)
  1. [Abstract] Abstract: the claim that non-Bayesian heuristic updates often outperform exact Bayesian computation in downstream task performance is presented without any reported datasets, sample sizes, statistical tests, or controls. This absence directly undermines evaluation of whether the data support the central claims of internal inconsistencies and world-model misspecification.
  2. [Methods / Experimental Setup] Experimental design (methods section): the information processing gap is used to distinguish nearly-Bayesian from heuristic updates, yet no controls are described for prompt-format invariance (e.g., verbalization of priors/likelihoods, chain-of-thought instructions, or output format). Without such controls, differences attributed to internal representations may instead reflect elicitation artifacts, rendering the Bayesian-vs-heuristic classification and the downstream-performance comparison load-bearing but insecure.
minor comments (1)
  1. [Introduction / §2] Define the information processing gap with an explicit equation or pseudocode in the main text (rather than deferring to an appendix) to improve clarity of the core metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of presentation and experimental robustness. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that non-Bayesian heuristic updates often outperform exact Bayesian computation in downstream task performance is presented without any reported datasets, sample sizes, statistical tests, or controls. This absence directly undermines evaluation of whether the data support the central claims of internal inconsistencies and world-model misspecification.

    Authors: The abstract summarizes a finding whose supporting details (datasets, sample sizes, and statistical comparisons) appear in the main text. To improve immediate evaluability of the claim, we will revise the abstract to include a brief qualifier referencing the experimental basis. revision: yes

  2. Referee: [Methods / Experimental Setup] Experimental design (methods section): the information processing gap is used to distinguish nearly-Bayesian from heuristic updates, yet no controls are described for prompt-format invariance (e.g., verbalization of priors/likelihoods, chain-of-thought instructions, or output format). Without such controls, differences attributed to internal representations may instead reflect elicitation artifacts, rendering the Bayesian-vs-heuristic classification and the downstream-performance comparison load-bearing but insecure.

    Authors: This concern about elicitation artifacts is well-taken. Our experiments used fixed prompt templates, but we did not report systematic invariance tests. We will expand the methods section to document prompt standardization and add robustness analyses testing sensitivity to verbalization and output format. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of evidence incorporation methods

full rationale

The paper introduces the 'information processing gap' as a novel empirical technique for studying LLM belief updates and reports experimental comparisons across multiple incorporation approaches. Some yield near-Bayesian behavior while others appear heuristic, with downstream performance comparisons used to infer model misspecification. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations are present in the abstract or described methodology. The central claims rest on observable differences in task performance rather than any reduction to inputs by construction. This is a standard empirical study with no detectable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5714 in / 987 out tokens · 24389 ms · 2026-06-30T23:03:03.586569+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning

    cs.AI 2026-06 unverdicted novelty 6.0

    Humans and LLMs exhibit similar error patterns in common-sense reasoning, consistent with shared pattern-matching mechanisms rather than abstract world models.

Reference graph

Works this paper leans on

18 extracted references · 6 canonical work pages · cited by 1 Pith paper

  1. [1]

    Tibshirani and Larry Wasserman , title =

    doi: 10.1080/01621459.2017.1285773. Peter G Brodeur, Thomas A Buckley, Zahir Kanjee, Ethan Goh, Evelyn Bin Ling, Priyank Jain, Stephanie Cabral, Raja-Elie Abdulnour, Adrian D Haimovich, Jason A Freed, et al. Performance of a large language model on the reasoning tasks of a physician.Science, 392(6797):524–527,

  2. [2]

    Fatemeh Dehghani, Roya Dehghani, Yazdan Naderzadeh Ardebili, and Shahryar Rahnamayan

    URLhttps://arxiv.org/abs/2508.21184. Fatemeh Dehghani, Roya Dehghani, Yazdan Naderzadeh Ardebili, and Shahryar Rahnamayan. Large language models in legal systems: A survey.Humanities and Social Sciences Communications, 12(1):1977,

  3. [3]

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger

    URLhttps://arxiv.org/abs/2510.20886. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InInterna- tional Conference on Machine Learning, pages 1321–1330. PMLR,

  4. [4]

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

    doi: 10.1038/s41586-025-09422-z. Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14): 6421,

  5. [5]

    David Louapre

    URLhttps://arxiv.org/abs/2601.08584. David Louapre. Can LLMs play the game of science?,

  6. [6]

    Aliakbar Nafar, Kristen Brent Venable, and Parisa Kordjamshidi

    URLhttps://arxiv.org/abs/1911.02210. Aliakbar Nafar, Kristen Brent Venable, and Parisa Kordjamshidi. Reasoning over uncertain text by generative large language models. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty- Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on ...

  7. [7]

    doi: 10.1609/aaai.v39i23.34674

    ISBN 978-1-57735-897-8. doi: 10.1609/aaai.v39i23.34674. URLhttps://doi.org/10.1609/aaai.v39i23.34674. Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan Carlson, Matthew P Lungren, Bay Gross, Peter Hames, Mustafa Suleyman, Dominic King, and Eric Horvitz. Sequential ...

  8. [8]

    Theodore Papamarkou, Pierre Alquier, Matthias Bauer, Wray Buntine, Andrew Davison, Gintare Karolina Dziugaite, Maurizio Filippone, Andrew Y

    URLhttps://arxiv.org/abs/2511.13240. Theodore Papamarkou, Pierre Alquier, Matthias Bauer, Wray Buntine, Andrew Davison, Gintare Karolina Dziugaite, Maurizio Filippone, Andrew Y. K. Foong, Vincent Fortuin, Dimitris Fouskakis, Eyke Hüllermeier, Theofanis Kar- aletsos, Mohammad Emtiyaz Khan, Nikita Kotelevskii, Salem Lahlou, Yingzhen Li, Fang Liu, Clare Lyle...

  9. [9]

    URLhttps://ssrn.com/abstract=6143772

    doi: 10.2139/ssrn.6143772. URLhttps://ssrn.com/abstract=6143772. ESSEC Business School Research Paper, Forthcoming. Linlu Qiu, Fei Sha, Kelsey Allen, Yoon Kim, Tal Linzen, and Sjoerd van Steenkiste. Bayesian teaching enables probabilistic reasoning in large language models.Nature Communications,

  10. [10]

    Timo Pierre Schrader, Lukas Lange, Simon Razniewski, and Annemarie Friedrich

    URLhttps://arxiv.org/abs/2303.17548. Timo Pierre Schrader, Lukas Lange, Simon Razniewski, and Annemarie Friedrich. QUITE: Quantifying uncertainty in natural language text in Bayesian reasoning scenarios. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pag...

  11. [11]

    Fang, J., Jiang, H., Wang, K., Ma, Y ., Shi, J., Wang, X., He, X., and Chua, T

    Association for Computational Linguistics. doi: 10.18653/v1/ 2024.emnlp-main.153. URLhttps://aclanthology.org/2024.emnlp-main.153/. Omar Shaikh, Shardul Sapkota, Shan Rizvi, Eric Horvitz, Joon Sung Park, Diyi Yang, and Michael S Bernstein. Creating general user models from computer use. InProceedings of the 38th Annual ACM Symposium on User Interface Soft...

  12. [12]

    Just ask for calibration: Strategies for eliciting calibrated confidence scores from lan- guage models fine-tuned with human feedback

    Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from lan- guage models fine-tuned with human feedback. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Pro- ceedings of the 2023 Conference on Em...

  13. [13]

    doi: 10.18653/v1/2023.emnlp-main.330

    Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.330. URL https://aclanthology.org/2023.emnlp-main.330/. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian...

  14. [14]

    Arnold Zellner

    URLhttps://arxiv.org/abs/2505.09388. Arnold Zellner. Optimal information processing and Bayes’s Theorem.The American Statistician, 42(4):278–280,

  15. [15]

    Each row represents a species and its associated attribute values

    15 Table A2Animals dataset. Each row represents a species and its associated attribute values. Species Color F ood Habitat Locomotion Reproduces Active Exterior Pattern Social structure Size Extremities Lion brown vertebrate grassland terrestrial live birth nocturnal fur solid pack large paws Tiger orange vertebrate forest terrestrial live birth nocturnal...

  16. [16]

    The full dataset is shown in Tab

    to generate 50 species names (such that multiple species share the same phylogenetic group) and their values for 11 prespecified attributes – e.g., color, habitat, etc. The full dataset is shown in Tab. A2. We generate a set of 500 questions each with four answer choices by selecting a target species and then selecting three other options at random with a...

  17. [17]

    Based on the responses, what is the individual’s political ideology?

    which was originally developed to study whose opinions were being encoded into LLMs. We build our dataset by identifying ten waves of the questions each with a different theme to provide different information about individuals. We chose ten waves to capture a broad set of question themes without too much overlap and to maintain enough overlapping responde...

  18. [18]

    We analyze the 78 game traces in the benchmark that were generated by GPT-5.2 with high reasoning

    and evaluate how efficiently the LLMs process the information. We analyze the 78 game traces in the benchmark that were generated by GPT-5.2 with high reasoning. The traces contained 4–31 evidence steps. Note that we do not evaluate GPT-5.2’s information processing capabilities. Since the card-playing strategy of the original player determines which evide...