Express Your Doubts -- Probabilistic World Modeling Should not be Based on Token logprobs

Eitan Wagner; Omri Abend

arxiv: 2505.02072 · v2 · submitted 2025-05-04 · 💻 cs.CL · cs.AI

Express Your Doubts -- Probabilistic World Modeling Should not be Based on Token logprobs

Eitan Wagner , Omri Abend This is my paper

Pith reviewed 2026-05-22 16:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords language modelsprobability estimationtoken logprobssecond-order predictionworld modelingdistribution estimationresponse predictionuncertainty in LLMs

0 comments

The pith

LLM token logprobs should not be used to estimate probabilities of real-world events because training and task settings create conflicting output distributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how language models have moved from estimating distributions over strings to serving as general predictors that take text input and produce text output. It argues that this change creates a mismatch when people try to read the model's token probabilities as direct measures of how likely events are in the world. Different stages of training and different common uses pull the model toward different probability assignments for the same events, so the numbers that appear in the output cannot be trusted as event probabilities. The authors therefore recommend that models instead produce explicit probability statements as part of their normal text responses. This approach, which they term second-order prediction, sidesteps the conflicts that arise from reading log probabilities off the token distribution.

Core claim

The central claim is that the theoretical difference between estimating a full distribution over strings and predicting responses for downstream tasks produces distinct and sometimes opposing requirements on the probability values an LLM should assign. When these requirements are not aligned, treating the probabilities that the model assigns to its own tokens as estimates of external event probabilities leads to systematic errors. The paper therefore concludes that second-order prediction, in which the model states probabilities explicitly in its generated text, is the method that remains consistent across training regimes and application settings.

What carries the argument

Second-order prediction, the practice of requiring the model to include explicit probability assessments inside its textual output rather than deriving them from the token distribution.

If this is right

Common applications that read event probabilities from LLM token logprobs are likely to produce inconsistent or biased results across different prompts and training phases.
Explicitly asking models to output probability statements can remove the mismatch between what the training objective wants and what a downstream user wants.
Evaluation of probabilistic reasoning in LLMs should measure the accuracy of the probabilities the model states in text rather than the token probabilities it assigns during generation.
Training procedures could be modified to reward models for producing well-calibrated probability statements inside their responses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Decision systems that rely on LLM uncertainty estimates may need to switch from reading logprobs to parsing explicit probability phrases in the generated text.
The same distinction between distribution and response goals could appear in other generative models that are asked to output both content and confidence.
Calibration benchmarks for LLMs might be redesigned to test whether stated probabilities match observed frequencies rather than checking token-level likelihoods.

Load-bearing premise

The assumption that the goals of distribution estimation and response prediction produce genuinely different desired output distributions in the actual training and deployment settings of current language models.

What would settle it

A controlled test in which the same model, when run once under a distribution-estimation objective and once under a response-prediction objective, produces identical probability assignments for the same set of external events would undermine the claim of conflicting requirements.

read the original abstract

Language modeling has shifted in recent years from a distribution over strings to prediction models with textual inputs and outputs for general-purpose tasks. This position paper highlights the often overlooked implications of this shift for the use of large language models (LLMs) as probability estimators, especially for world probabilities. In light of the theoretical distinction between distribution estimation and response prediction, we examine LLM training phases and common use cases for LLM output probabilities. We show that the different settings lead to distinct, potentially conflicting, desired output distributions. This lack of clarity leads to pitfalls when using output probabilities as event probabilities. Our position advocates for second-order prediction -- incorporating probabilities explicitly as part of the output -- as a theoretically sound method, in contrast to using token logprobs. We conclude with suggestions for potential directions to improve the probabilistic soundness of this method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a real mismatch when LLMs shift from distribution estimation to response prediction, but the conflicting-distributions claim stays conceptual without examples.

read the letter

The main point is that LLMs are no longer just estimating string distributions; they are trained to produce responses, so their token probabilities do not automatically give the right numbers for world events. That distinction is worth stating plainly, and the paper connects it to common uses like planning and decision systems. It also recommends outputting probabilities explicitly instead of relying on logprobs, which is a straightforward way to avoid the ambiguity.

Referee Report

1 major / 2 minor

Summary. This position paper argues that the recent shift in language modeling from pure string distribution estimation to general-purpose response prediction in LLMs creates distinct and potentially conflicting target distributions over tokens. It examines training phases and common use cases to show that treating token logprobs as event probabilities for world modeling leads to pitfalls, and advocates instead for second-order prediction in which probabilities are explicitly generated as part of the textual output.

Significance. If the conceptual distinction and resulting conflicts are substantiated, the paper offers a timely theoretical lens on the misuse of LLM output probabilities in probabilistic reasoning tasks. It could usefully steer the field toward more explicit uncertainty modeling rather than post-hoc interpretation of logprobs, with implications for applications requiring calibrated world probabilities.

major comments (1)

[Abstract / training phases discussion] Abstract and the section examining LLM training phases and use cases: the central claim that these settings produce 'distinct, potentially conflicting, desired output distributions' is asserted without a concrete side-by-side derivation or worked example (e.g., the optimal next-token distribution under pre-training MLE versus under a calibrated probabilistic query). This gap directly affects whether the advocated pitfalls are demonstrated rather than stipulated.

minor comments (2)

[Advocacy section] The term 'second-order prediction' is introduced without an explicit formal definition or pseudocode sketch of how probabilities would be incorporated into the output format.
[Related work] A few citations to prior work on calibration and uncertainty in LLMs appear to be missing from the related-work discussion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential significance of the position paper. We address the major comment below and will revise the manuscript to incorporate a concrete example as suggested.

read point-by-point responses

Referee: [Abstract / training phases discussion] Abstract and the section examining LLM training phases and use cases: the central claim that these settings produce 'distinct, potentially conflicting, desired output distributions' is asserted without a concrete side-by-side derivation or worked example (e.g., the optimal next-token distribution under pre-training MLE versus under a calibrated probabilistic query). This gap directly affects whether the advocated pitfalls are demonstrated rather than stipulated.

Authors: We agree that a concrete side-by-side derivation would make the central claim more demonstrative. In the revised manuscript, we will add a worked example contrasting the optimal next-token distribution under standard pre-training MLE (which aligns with the empirical distribution of tokens in the training data) with the distribution under a setting that explicitly targets calibrated world probabilities (e.g., via an auxiliary loss encouraging the model to output probabilities matching observed event frequencies in a probabilistic query). This addition will illustrate the divergence in desired outputs arising from the shift between distribution estimation and response prediction, thereby addressing the concern that the pitfalls are stipulated rather than shown. revision: yes

Circularity Check

0 steps flagged

Conceptual position paper on LLM probability use exhibits no circular derivation

full rationale

The paper is a position paper that relies on a pre-existing theoretical distinction between distribution estimation and response prediction to argue for pitfalls in using token logprobs as event probabilities. It qualitatively examines training phases and use cases to highlight potentially conflicting desired distributions, then advocates second-order prediction as an alternative. No mathematical derivations, equations, or fitted parameters are presented that reduce any claimed result to the paper's own inputs by construction. The central claims remain independent of self-citation chains or self-definitional loops, making the argument self-contained against external benchmarks in probabilistic modeling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the conceptual distinction between distribution estimation and response prediction without introducing new mathematical parameters or entities.

axioms (1)

domain assumption The theoretical distinction between distribution estimation and response prediction leads to distinct and potentially conflicting desired output distributions in LLM training and use cases.
This premise is invoked to explain why token logprobs are unsuitable for world probability estimation.

pith-pipeline@v0.9.0 · 5666 in / 1154 out tokens · 37325 ms · 2026-05-22T16:00:19.243771+00:00 · methodology

Express Your Doubts -- Probabilistic World Modeling Should not be Based on Token logprobs

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)