When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming

Francesco Corielli

arxiv: 2605.23278 · v2 · pith:ZDD6TL7Enew · submitted 2026-05-22 · 💻 cs.CL · stat.ML

When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming

Francesco Corielli This is my paper

Pith reviewed 2026-05-25 05:00 UTC · model grok-4.3

classification 💻 cs.CL stat.ML

keywords next-token predictionmarginalizationergodicityconditional sufficiencyRAGtool usemixture identifiabilitylanguage models

0 comments

The pith

Next-token prediction estimates the marginal text-only law and is useful only when observed prefixes are approximately sufficient statistics for latent circumstances.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper distinguishes the full conditional language process (conditioned on latent facts, intentions, and context), the marginal text-only process obtained by integrating those circumstances out, and the distribution learned from finite observed sequences. Interpreting training as estimating the marginal requires assumptions of stationarity, representativeness, and ergodicity that are standard in statistics but difficult to justify for heterogeneous language data. Usefulness of the resulting model for next-token prediction further requires that the residual conditional mutual information between the next token and the omitted circumstances, given the text prefix, be small. The argument extends to heterogeneous corpora and treats RAG and tool use as mechanisms that increase conditional sufficiency.

Core claim

A model trained on realized token trajectories receives sampled continuations and therefore estimates the marginal text-only process rather than the full conditional law; this marginal is useful for prediction only when the observed prefix is an approximately sufficient statistic for the latent circumstances relevant to continuation, which holds when residual conditional mutual information is small.

What carries the argument

The three-way distinction among the full conditional language process, the marginal text-only process, and the model-induced distribution, with local sufficiency of the observed prefix serving as the condition for usefulness.

If this is right

RAG improves next-token prediction by supplying additional context that reduces residual mutual information with omitted circumstances.
Tool use functions as a conditional sufficiency device that augments the observed text with external information.
In heterogeneous training corpora the identifiability of mixture components depends on the same sufficiency conditions.
Programming tasks require richer context because code continuations depend on non-textual goals and constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the sufficiency condition fails, scaling data volume alone will not close the gap between marginal and conditional performance.
Tasks with rapidly changing external circumstances may require explicit conditioning mechanisms beyond pure next-token training.
The same marginal-versus-conditional distinction applies to any sequential prediction setting where observations are generated under varying latent regimes.

Load-bearing premise

Real language corpora can be meaningfully analyzed as samples from a stationary ergodic process whose marginal can be estimated from finite observed trajectories.

What would settle it

A direct measurement showing that next-token prediction error remains high even after conditioning on prefixes that are information-theoretically sufficient for the relevant latent circumstances would falsify the usefulness criterion.

Figures

Figures reproduced from arXiv: 2605.23278 by Francesco Corielli.

**Figure 2.** Figure 2: Programming as a favorable regime: specifications, previous code, tests, and errors [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

read the original abstract

Language models trained on observed sequences are often described as learning the conditional distribution of the next token given previous tokens. This description is only conditionally correct. A model trained on realized token trajectories does not observe full conditional laws; it receives sampled continuations. Moreover, real language generation is conditioned not only on previous words but also on non-textual circumstances: facts, events, intentions, goals, beliefs, social context, and task-specific constraints. This paper distinguishes three objects that are often conflated: the full conditional language process conditioned on latent circumstances, the marginal text-only process obtained by integrating those circumstances out, and the model-induced distribution learned from finite observed corpora. The paper argues that interpreting model training as estimating the marginal text-only law requires strong assumptions of stationarity, representativeness, and ergodicity, assumptions that are standard in statistical estimation but problematic when applied to heterogeneous language corpora. Even if these assumptions hold, the marginal text-only law is useful only when the observed prefix is an approximately sufficient statistic for the latent circumstances relevant to continuation. In information-theoretic terms, usefulness requires that the residual conditional mutual information between the next token and the omitted circumstances, given the observed text, be small. The paper then extends this argument to heterogeneous training corpora. Finally, the paper interprets Retrieval Augmented Generation (RAG) and tool use as conditional sufficiency devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper cleanly separates the full conditional process, the marginal text law, and the learned model, then ties usefulness to low residual mutual information and frames RAG/tools as sufficiency fixes; the argument is standard stats applied to LMs but lacks any derivation or test.

read the letter

The core point is that training on token sequences estimates a marginal over text rather than the true conditional law given latent circumstances, and this marginal is only useful when the observed prefix carries most of the relevant information about those circumstances. The paper spells out the stationarity, ergodicity, and representativeness assumptions needed for that marginal to be well-defined from finite data, notes that language corpora violate them, and says the residual conditional mutual information must be small for next-token prediction to work in practice. It then treats RAG and tool use as ways to shrink that residual term by supplying the missing context. That framing is clear and follows directly from the chain rule and sufficiency definitions. What is actually new is the explicit mapping of augmentation methods onto conditional sufficiency; the rest recycles standard information-theoretic distinctions. The paper does this without circularity or invented quantities. The main limitation is that everything stays at the level of definitions and verbal argument. There are no derivations showing when the residual term is small, no bounds, and no empirical checks on real corpora or models. The ergodicity critique is familiar and the paper does not add much depth to it. Heterogeneous corpora are mentioned but not analyzed in detail. This is useful for readers who already think about LM training in information-theoretic terms and want a compact way to organize why pure next-token models need external help. It is not a result that changes practice or theory on its own. A serious editor should send it to review so the distinctions can be stress-tested by people who work on conditional generation and retrieval methods.

Referee Report

1 major / 1 minor

Summary. The paper distinguishes the full conditional language process (conditioned on latent circumstances), the marginal text-only process (circumstances integrated out), and the model distribution learned from finite corpora. It argues that next-token prediction training estimates the marginal only under stationarity, representativeness, and ergodicity assumptions (problematic for heterogeneous language data) and is useful only when the observed prefix is approximately sufficient for the relevant latent circumstances, i.e., when residual conditional mutual information I(next token; circumstances | text) is small. The argument is extended to heterogeneous corpora, and RAG/tool use is interpreted as providing conditional sufficiency.

Significance. If the framework holds, it supplies a clean information-theoretic lens for understanding the scope and limits of next-token training, the mismatch between language data and standard statistical assumptions, and the mechanistic role of retrieval and tools. This could usefully inform both theoretical analyses of LM capabilities and practical system design.

major comments (1)

[Abstract] Abstract: the usefulness claim rests on the residual conditional mutual information being small, yet the manuscript supplies neither a formal derivation of this condition from the chain rule nor any concrete bounds or corpus examples showing when the term is plausibly negligible; without such support the central practical implication remains untested.

minor comments (1)

The extension to heterogeneous training corpora is announced but receives no detailed treatment or examples in the provided text; a short dedicated subsection would clarify how the stationarity/ergodicity issues compound across domains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on the manuscript. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the usefulness claim rests on the residual conditional mutual information being small, yet the manuscript supplies neither a formal derivation of this condition from the chain rule nor any concrete bounds or corpus examples showing when the term is plausibly negligible; without such support the central practical implication remains untested.

Authors: We agree that the abstract and surrounding discussion would be strengthened by an explicit derivation and supporting illustrations. The key condition follows from the chain rule: H(next token | text) = H(next token | text, circumstances) + I(next token; circumstances | text). When the residual mutual information term is small, the marginal next-token law given text approximates the full conditional law. We will insert this derivation into the revised abstract and add a short subsection with illustrative cases (e.g., technical prose versus open-ended dialogue) showing domains where the term is plausibly negligible. These changes will be incorporated in the next version. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper distinguishes the full conditional process, marginal text-only law, and learned model via standard information-theoretic definitions (chain rule, conditional mutual information, sufficiency). It states assumptions of stationarity/ergodicity/representativeness explicitly as requirements for interpreting training as marginal estimation, without deriving any quantity from fitted parameters or self-citations. RAG/tool-use are positioned as mechanisms to reduce residual I(next token; circumstances | text), following directly from the definitions without reduction to inputs. No equations or claims reduce by construction to the paper's own outputs; the argument is self-contained against external statistical concepts.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on domain assumptions from statistics and information theory applied to language data; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Language generation is conditioned on latent non-textual circumstances (facts, events, intentions, goals, beliefs, social context).
Invoked in the abstract as the basis for distinguishing the full conditional process from the marginal text-only process.
domain assumption Training corpora can be treated under assumptions of stationarity, representativeness, and ergodicity for marginal estimation.
Explicitly discussed in the abstract as standard statistical assumptions that are required but problematic for heterogeneous language data.

pith-pipeline@v0.9.0 · 5785 in / 1393 out tokens · 28402 ms · 2026-05-25T05:00:38.263086+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

interpreting model training as estimating the marginal text-only law requires strong assumptions of stationarity, representativeness, and ergodicity... residual conditional mutual information I(Xt+1;Zt | X≤t)≈0
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

mixture conditional pmix(xt+1 | x≤t) = Σ p(k|x≤t) pk(xt+1 | x≤t)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.