pith. machine review for the scientific record. sign in

arxiv: 2605.06915 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

LLMs are not (consistently) Bayesian: Quantifying internal (in)consistencies of LLMs' probabilistic beliefs

Adam Goli\'nski, Chacha Chen, Guillermo Sapiro, Masha Fedzechkina, Matthew J\"orke, Nicholas Foti, Sinead Williamson

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:49 UTC · model grok-4.3

classification 💻 cs.LG
keywords large language modelsBayesian updatingprobabilistic beliefsinformation processing gapbelief consistencyheuristicsmodel misspecification
0
0 comments X

The pith

Large language models do not consistently update probabilistic beliefs according to Bayesian rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to treat LLMs as information processing rules and introduces the information processing gap to measure how consistently they update beliefs when new evidence arrives. Experiments across multiple incorporation approaches show that some produce nearly Bayesian updates while others rely on learned heuristics. The heuristic approaches often deliver better performance on downstream tasks than exact Bayesian computation, which points to misspecification in the models' internal representations of the world. The gap measure additionally functions as a diagnostic for problems in LLM-based inference systems.

Core claim

LLMs are not consistently Bayesian: when evidence is incorporated, some methods produce nearly Bayesian belief updates while others follow learned heuristics, and the non-Bayesian heuristics frequently outperform exact Bayesian computation on tasks, indicating that the models' probabilistic world models are misspecified.

What carries the argument

The information processing gap, a quantitative measure of the inconsistency between an LLM's actual belief updates from evidence and the updates required by Bayesian probability.

If this is right

  • Some methods for incorporating evidence into LLMs achieve nearly Bayesian updates.
  • Other methods use learned heuristics that deviate from Bayesian standards.
  • Heuristic-based updates can produce higher downstream task performance than exact Bayesian computation.
  • The information processing gap serves as a diagnostic for identifying issues in LLM-powered inferential systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Forcing stricter Bayesian consistency might reduce rather than improve practical performance in some applications.
  • The gap could be applied to test consistency in other forms of reasoning beyond probability.
  • Training data may encourage approximate heuristics over exact inference rules.

Load-bearing premise

That LLMs maintain stable internal probabilistic beliefs which can be reliably elicited and that the information processing gap captures genuine inconsistencies rather than prompting artifacts.

What would settle it

Repeating the same evidence incorporation experiment with varied but logically equivalent prompt phrasings or interfaces and finding that the measured gaps and downstream performance rankings remain unchanged.

read the original abstract

Modern AI systems are being deployed in complex domains such as medicine, science, and law, where it is important that they not only produce correct answers, but also represent and update uncertain beliefs about the world as new evidence arrives. We introduce the novel technique of studying LLMs as information processing rules and utilize the information processing gap to study the internal (in)consistencies of how LLMs update their probabilistic beliefs from evidence. Our extensive experiments evaluate multiple approaches in which LLMs can incorporate evidence into their beliefs. Some of these approaches produce (nearly) Bayesian updates; others seem to use a learned heuristic. Surprisingly, the non-Bayesian heuristic updates often outperform exact Bayesian computation in terms of downstream task performance -- indicating the LLMs' probabilistic models of the world are misspecified. Lastly, we show how our measure can provide diagnostics to identify issues with LLM-powered inferential systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the 'information processing gap' to quantify inconsistencies in how LLMs update probabilistic beliefs upon receiving evidence. It evaluates multiple evidence-incorporation methods via prompting, finding that some produce nearly Bayesian updates while others rely on learned heuristics; surprisingly, the heuristic updates often outperform exact Bayesian computation on downstream tasks, which the authors interpret as evidence that LLMs' internal world models are misspecified. The work also positions the gap as a diagnostic tool for LLM-powered inferential systems.

Significance. If the prompted elicitations reliably reflect stable internal probabilistic representations rather than surface artifacts, the results would be significant for understanding and improving LLM reasoning under uncertainty in domains like medicine and science. The empirical comparison of update rules and the counterintuitive finding that non-Bayesian heuristics can outperform exact Bayes provide a concrete, falsifiable lens on model misspecification that goes beyond standard accuracy benchmarks.

major comments (2)
  1. [§3 and §4] §3 (Method) and §4 (Experiments): The information processing gap is defined by comparing a direct prompted posterior to a Bayesian posterior computed from separately elicited prior and likelihood. This comparison is load-bearing for the central claim that some updates are 'nearly Bayesian' while others are heuristic, yet the paper reports no systematic robustness checks across elicitation variants (e.g., different phrasings, order of evidence presentation, or temperature settings). Without such controls, the gap and the downstream performance advantage of heuristics could reflect prompt sensitivity rather than internal (in)consistency.
  2. [§4.3] §4.3 (Downstream task evaluation): The claim that non-Bayesian heuristic updates 'often outperform exact Bayesian computation' is central to the misspecification interpretation. However, the reported effect sizes and statistical significance are not broken down by task or model scale, and it is unclear whether the Bayesian baseline uses the same elicited quantities or an oracle prior; this makes it difficult to isolate whether the advantage stems from misspecification or from differences in how evidence is represented.
minor comments (2)
  1. [Eq. (2)] Notation for the information processing gap (Eq. 2) could be clarified by explicitly stating whether the gap is an absolute difference, KL divergence, or another metric, and by providing the exact formula used for the Bayesian posterior computation.
  2. [Figure 3] Figure 3 (or equivalent) showing update consistency across methods would benefit from error bars or confidence intervals to allow readers to assess the reliability of the 'nearly Bayesian' classification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the robustness and interpretability of our results on the information processing gap. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (Method) and §4 (Experiments): The information processing gap is defined by comparing a direct prompted posterior to a Bayesian posterior computed from separately elicited prior and likelihood. This comparison is load-bearing for the central claim that some updates are 'nearly Bayesian' while others are heuristic, yet the paper reports no systematic robustness checks across elicitation variants (e.g., different phrasings, order of evidence presentation, or temperature settings). Without such controls, the gap and the downstream performance advantage of heuristics could reflect prompt sensitivity rather than internal (in)consistency.

    Authors: We agree that systematic robustness checks are essential to ensure the information processing gap captures internal (in)consistencies rather than elicitation artifacts. While the original experiments used a consistent prompting protocol across models and tasks, we did not report exhaustive variations. In the revised manuscript, we will add a dedicated robustness subsection to §4 that includes: (i) multiple alternative phrasings for prior, likelihood, and posterior elicitation; (ii) variations in evidence presentation order; and (iii) results across temperatures from 0.0 to 1.0. We will quantify how the gap and downstream performance differences vary (or remain stable) under these conditions, with statistical summaries of sensitivity. revision: yes

  2. Referee: [§4.3] §4.3 (Downstream task evaluation): The claim that non-Bayesian heuristic updates 'often outperform exact Bayesian computation' is central to the misspecification interpretation. However, the reported effect sizes and statistical significance are not broken down by task or model scale, and it is unclear whether the Bayesian baseline uses the same elicited quantities or an oracle prior; this makes it difficult to isolate whether the advantage stems from misspecification or from differences in how evidence is represented.

    Authors: We appreciate the request for greater transparency in the downstream evaluation. The Bayesian baseline throughout §4.3 is computed from the same elicited priors and likelihoods used for the prompted posteriors (not an oracle). To address the lack of breakdowns, the revision will expand §4.3 with per-task and per-scale tables reporting: mean performance differences, effect sizes (Cohen's d), and p-values from paired statistical tests. These will be disaggregated by task domain (e.g., medical, scientific) and model size, allowing clearer assessment of whether the heuristic advantage is consistent with misspecification. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation of LLM belief updates with no self-referential derivations or fitted predictions

full rationale

The paper frames its contribution as an empirical study introducing the 'information processing gap' to compare LLM evidence incorporation methods against exact Bayesian updates. No load-bearing mathematical derivations, parameter fits renamed as predictions, or self-citation chains appear in the abstract or described methodology. All central claims rest on experimental comparisons of prompted outputs versus computed baselines, which are externally falsifiable and do not reduce to the paper's own inputs by construction. This is the expected non-circular outcome for a purely evaluative work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on treating LLMs as having elicitable probabilistic beliefs and on the validity of the new information processing gap as a diagnostic; no free parameters are described, but the approach introduces a new conceptual entity without external validation beyond the reported experiments.

axioms (1)
  • domain assumption LLMs maintain internal probabilistic beliefs that can be accessed and compared to ideal Bayesian updates through prompting or other interfaces.
    This is the foundational premise for defining the information processing gap and evaluating consistency.
invented entities (1)
  • information processing gap no independent evidence
    purpose: Quantifies the difference between an LLM's actual belief update and the ideal Bayesian update from evidence.
    New measure introduced to study internal inconsistencies; no independent evidence provided outside the paper's experiments.

pith-pipeline@v0.9.0 · 5483 in / 1445 out tokens · 86054 ms · 2026-05-11T00:49:03.024641+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

17 extracted references · 13 canonical work pages · 4 internal anchors

  1. [1]

    M., KUCUKELBIR, A., MCAULIFFE, J

    doi: 10.1080/01621459.2017.1285773. Peter G Brodeur, Thomas A Buckley, Zahir Kanjee, Ethan Goh, Evelyn Bin Ling, Priyank Jain, Stephanie Cabral, Raja-Elie Abdulnour, Adrian D Haimovich, Jason A Freed, et al. Performance of a large language model on the reasoning tasks of a physician.Science, 392(6797):524–527,

  2. [2]

    BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

    URLhttps://arxiv.org/abs/2508.21184. Fatemeh Dehghani, Roya Dehghani, Yazdan Naderzadeh Ardebili, and Shahryar Rahnamayan. Large language models in legal systems: A survey.Humanities and Social Sciences Communications, 12(1):1977,

  3. [3]

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger

    URLhttps://arxiv.org/abs/2510.20886. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InInterna- tional conference on machine learning, pages 1321–1330. PMLR,

  4. [4]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  5. [5]

    Ministral 3

    Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584,

  6. [6]

    Aliakbar Nafar, Kristen Brent Venable, and Parisa Kordjamshidi

    URLhttps://arxiv.org/abs/1911.02210. Aliakbar Nafar, Kristen Brent Venable, and Parisa Kordjamshidi. Reasoning over uncertain text by generative large language models. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty- Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on ...

  7. [7]

    doi: 10.1609/aaai.v39i23.34674

    ISBN 978-1-57735-897-8. doi: 10.1609/aaai.v39i23.34674. URLhttps://doi.org/10.1609/aaai.v39i23.34674. 11 Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan Carlson, Matthew P Lungren, et al. Sequential diagnosis with language models. arXiv preprint arXiv:2506.22405,

  8. [8]

    Incoherent beliefs & inconsistent actions in large language models.arXiv preprint arXiv:2511.13240,

    Arka Pal, Teo Kitanovski, Arthur Liang, Akilesh Potti, and Micah Goldblum. Incoherent beliefs & inconsistent actions in large language models.arXiv preprint arXiv:2511.13240,

  9. [9]

    URLhttps://ssrn.com/abstract=6143772

    doi: 10.2139/ssrn.6143772. URLhttps://ssrn.com/abstract=6143772. ESSEC Business School Research Paper, Forthcoming. Linlu Qiu, Fei Sha, Kelsey Allen, Yoon Kim, Tal Linzen, and Sjoerd van Steenkiste. Bayesian teaching enables probabilistic reasoning in large language models.Nature Communications,

  10. [10]

    arXiv preprint arXiv:2303.17548 , year=

    URLhttps://arxiv.org/abs/2303.17548. Timo Pierre Schrader, Lukas Lange, Simon Razniewski, and Annemarie Friedrich. QUITE: Quantifying uncertainty in natural language text in Bayesian reasoning scenarios. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pag...

  11. [11]

    Haoran Luo, Haihong E, Zichen Tang, Shiyao Peng, Yikai Guo, Wentai Zhang, Chenghao Ma, Guanting Dong, Meina Song, Wei Lin, Yifan Zhu, and Anh Tuan Luu

    Association for Computational Linguistics. doi: 10.18653/v1/ 2024.emnlp-main.153. URLhttps://aclanthology.org/2024.emnlp-main.153/. Omar Shaikh, Shardul Sapkota, Shan Rizvi, Eric Horvitz, Joon Sung Park, Diyi Yang, and Michael S Bernstein. Creating general user models from computer use. InProceedings of the 38th Annual ACM Symposium on User Interface Soft...

  12. [12]

    arXiv preprint arXiv:2305.14975 (2023) Confidence Estimation in Automatic Short Answer Grading with LLMs 15

    Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christo- pher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback.arXiv preprint arXiv:2305.14975,

  13. [13]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  14. [14]

    The following figure indicates the number of observations with evidence at the specified evidence step

    Table 1Dataset Statistics Dataset Datapoints Options Evidence Steps 80% Evidence Step Animals with Attributes 500 4 11 11 OpinionQA 575 5 10 10 MediQ 494 ∗ 4–10 variable 4 Eleusis 78 5 4–31 11 ∗ 6 datapoints were filtered out because they had less than 3 evidence steps. The following figure indicates the number of observations with evidence at the specifi...

  15. [15]

    We generate a set of 500 questions each with four answer choices by selecting a target species and then selecting three other options at random with at least one coming from the same family – e.g. bear. This construction mitigates the computational scaling of the Zellner computations with the number of options while also making the questions challenging. ...

  16. [16]

    Based on the responses, what is the individual’s political ideology?

    which was originally developed to study whose opinions were being encoded into LLMs. We build our dataset by identifying ten waves of the questions each with a different theme to provide different information about individuals. We chose ten waves to capture a broad set of question themes without too much overlap and to maintain enough overlapping responde...

  17. [17]

    and evaluate how efficiently the LLMs process the information. Each split corresponds to game traces generated by a different LLM player from the original benchmark, yielding 12 splits of 78 games each, with between 4 and 31 evidence steps per trace. Note that the LLMs we evaluate are different from those that generated the game traces. Since the card-pla...