Express Your Doubts -- Probabilistic World Modeling Should not be Based on Token logprobs
Pith reviewed 2026-05-22 16:00 UTC · model grok-4.3
The pith
LLM token logprobs should not be used to estimate probabilities of real-world events because training and task settings create conflicting output distributions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the theoretical difference between estimating a full distribution over strings and predicting responses for downstream tasks produces distinct and sometimes opposing requirements on the probability values an LLM should assign. When these requirements are not aligned, treating the probabilities that the model assigns to its own tokens as estimates of external event probabilities leads to systematic errors. The paper therefore concludes that second-order prediction, in which the model states probabilities explicitly in its generated text, is the method that remains consistent across training regimes and application settings.
What carries the argument
Second-order prediction, the practice of requiring the model to include explicit probability assessments inside its textual output rather than deriving them from the token distribution.
If this is right
- Common applications that read event probabilities from LLM token logprobs are likely to produce inconsistent or biased results across different prompts and training phases.
- Explicitly asking models to output probability statements can remove the mismatch between what the training objective wants and what a downstream user wants.
- Evaluation of probabilistic reasoning in LLMs should measure the accuracy of the probabilities the model states in text rather than the token probabilities it assigns during generation.
- Training procedures could be modified to reward models for producing well-calibrated probability statements inside their responses.
Where Pith is reading between the lines
- Decision systems that rely on LLM uncertainty estimates may need to switch from reading logprobs to parsing explicit probability phrases in the generated text.
- The same distinction between distribution and response goals could appear in other generative models that are asked to output both content and confidence.
- Calibration benchmarks for LLMs might be redesigned to test whether stated probabilities match observed frequencies rather than checking token-level likelihoods.
Load-bearing premise
The assumption that the goals of distribution estimation and response prediction produce genuinely different desired output distributions in the actual training and deployment settings of current language models.
What would settle it
A controlled test in which the same model, when run once under a distribution-estimation objective and once under a response-prediction objective, produces identical probability assignments for the same set of external events would undermine the claim of conflicting requirements.
read the original abstract
Language modeling has shifted in recent years from a distribution over strings to prediction models with textual inputs and outputs for general-purpose tasks. This position paper highlights the often overlooked implications of this shift for the use of large language models (LLMs) as probability estimators, especially for world probabilities. In light of the theoretical distinction between distribution estimation and response prediction, we examine LLM training phases and common use cases for LLM output probabilities. We show that the different settings lead to distinct, potentially conflicting, desired output distributions. This lack of clarity leads to pitfalls when using output probabilities as event probabilities. Our position advocates for second-order prediction -- incorporating probabilities explicitly as part of the output -- as a theoretically sound method, in contrast to using token logprobs. We conclude with suggestions for potential directions to improve the probabilistic soundness of this method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This position paper argues that the recent shift in language modeling from pure string distribution estimation to general-purpose response prediction in LLMs creates distinct and potentially conflicting target distributions over tokens. It examines training phases and common use cases to show that treating token logprobs as event probabilities for world modeling leads to pitfalls, and advocates instead for second-order prediction in which probabilities are explicitly generated as part of the textual output.
Significance. If the conceptual distinction and resulting conflicts are substantiated, the paper offers a timely theoretical lens on the misuse of LLM output probabilities in probabilistic reasoning tasks. It could usefully steer the field toward more explicit uncertainty modeling rather than post-hoc interpretation of logprobs, with implications for applications requiring calibrated world probabilities.
major comments (1)
- [Abstract / training phases discussion] Abstract and the section examining LLM training phases and use cases: the central claim that these settings produce 'distinct, potentially conflicting, desired output distributions' is asserted without a concrete side-by-side derivation or worked example (e.g., the optimal next-token distribution under pre-training MLE versus under a calibrated probabilistic query). This gap directly affects whether the advocated pitfalls are demonstrated rather than stipulated.
minor comments (2)
- [Advocacy section] The term 'second-order prediction' is introduced without an explicit formal definition or pseudocode sketch of how probabilities would be incorporated into the output format.
- [Related work] A few citations to prior work on calibration and uncertainty in LLMs appear to be missing from the related-work discussion.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for recognizing the potential significance of the position paper. We address the major comment below and will revise the manuscript to incorporate a concrete example as suggested.
read point-by-point responses
-
Referee: [Abstract / training phases discussion] Abstract and the section examining LLM training phases and use cases: the central claim that these settings produce 'distinct, potentially conflicting, desired output distributions' is asserted without a concrete side-by-side derivation or worked example (e.g., the optimal next-token distribution under pre-training MLE versus under a calibrated probabilistic query). This gap directly affects whether the advocated pitfalls are demonstrated rather than stipulated.
Authors: We agree that a concrete side-by-side derivation would make the central claim more demonstrative. In the revised manuscript, we will add a worked example contrasting the optimal next-token distribution under standard pre-training MLE (which aligns with the empirical distribution of tokens in the training data) with the distribution under a setting that explicitly targets calibrated world probabilities (e.g., via an auxiliary loss encouraging the model to output probabilities matching observed event frequencies in a probabilistic query). This addition will illustrate the divergence in desired outputs arising from the shift between distribution estimation and response prediction, thereby addressing the concern that the pitfalls are stipulated rather than shown. revision: yes
Circularity Check
Conceptual position paper on LLM probability use exhibits no circular derivation
full rationale
The paper is a position paper that relies on a pre-existing theoretical distinction between distribution estimation and response prediction to argue for pitfalls in using token logprobs as event probabilities. It qualitatively examines training phases and use cases to highlight potentially conflicting desired distributions, then advocates second-order prediction as an alternative. No mathematical derivations, equations, or fitted parameters are presented that reduce any claimed result to the paper's own inputs by construction. The central claims remain independent of self-citation chains or self-definitional loops, making the argument self-contained against external benchmarks in probabilistic modeling.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The theoretical distinction between distribution estimation and response prediction leads to distinct and potentially conflicting desired output distributions in LLM training and use cases.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.