Recognition: 2 theorem links
· Lean TheoremDrift and selection in LLM text ecosystems
Pith reviewed 2026-05-15 11:46 UTC · model grok-4.3
The pith
Normative selection sustains deeper structure in recursive LLM text generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the variable-order n-gram model of recursive text generation, unfiltered drift produces stable distributions exactly characterized by the progressive loss of rare forms and higher-order dependencies. When publication applies normative filters that reward quality, correctness, or novelty, deeper dependencies persist and the divergence from the shallow equilibrium is bounded from above by an optimal constant that the authors derive.
What carries the argument
Variable-order n-gram agents, which generate text by conditioning predictions on contexts of varying lengths and allow exact computation of stable distributions under drift and selection.
If this is right
- Unfiltered reuse of generated text drives the public corpus to shallow equilibria where additional context provides no predictive gain.
- Normative filters that reward quality or novelty prevent convergence to those shallow states.
- The divergence from shallow equilibria under normative selection is bounded by an optimal constant.
- The framework directly informs how training corpora for future models should be filtered to preserve or compress linguistic structure.
Where Pith is reading between the lines
- Repeated self-training loops on uncurated LLM outputs are likely to reduce linguistic diversity over time.
- Corpus builders could apply explicit quality or novelty thresholds to counteract drift and retain richer training material.
- The same drift-selection distinction may apply to recursive generation in code, images, or other modalities.
- Direct tests could track changes in n-gram entropy across generations of real LLM output.
Load-bearing premise
Variable-order n-gram agents provide a sufficient model of LLM text generation to capture the essential dynamics of drift and selection in the recursive public corpus.
What would settle it
An empirical measurement of higher-order n-gram frequencies in successively generated LLM corpora showing either convergence to the exact shallow equilibrium under unfiltered reuse or maintenance of statistics above the derived bound under normative selection.
Figures
read the original abstract
The public text record -- the material from which both people and AI systems now learn -- is increasingly shaped by its own outputs. Generated text enters the public record, later agents learn from it, and the cycle repeats. Here we develop an exactly solvable mathematical framework for this recursive process, based on variable-order $n$-gram agents, and separate two forces acting on the public corpus. The first is drift: unfiltered reuse progressively removes rare forms, and in the infinite-corpus limit we characterise the stable distributions exactly. The second is selection: publication, ranking and verification filter what enters the record, and the outcome depends on what is selected. When publication merely reflects the statistical status quo, the corpus converges to a shallow state in which further lookahead brings no benefit. When publication is normative -- rewarding quality, correctness or novelty -- deeper structure persists, and we establish an optimal upper bound on the resulting divergence from shallow equilibria. The framework therefore identifies when recursive publication compresses public text and when selective filtering sustains richer structure, with implications for the design of AI training corpora.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops an exactly solvable mathematical framework using variable-order n-gram agents to analyze the recursive dynamics of the public text corpus influenced by LLM-generated outputs. It distinguishes drift, which leads to the loss of rare forms with exact characterizations of stable distributions in the infinite-corpus limit, from selection effects driven by publication, ranking, and verification. The key result is that normative selection—rewarding quality, correctness, or novelty—maintains deeper structure, establishing an optimal upper bound on divergence from shallow equilibria, while non-normative publication leads to compression.
Significance. Should the derivations prove robust, this framework provides a rigorous, parameter-free tool for understanding how self-referential text generation affects linguistic diversity and complexity. The exact solvability and separation of forces offer clear implications for designing AI training corpora to preserve richness, crediting the machine-checked or exact nature of the stable distributions as a strength.
major comments (1)
- [Model and derivation sections (e.g., §2–§4)] The assumption that variable-order n-gram agents sufficiently model LLM text generation to capture essential drift and selection dynamics is load-bearing for the central claim. The exact stable distributions are derived only for the n-gram case, but no demonstration is provided that the optimal upper bound on divergence survives when the generator is replaced by a context-aware model like a transformer, which can maintain rare forms and semantic novelty through long-range attention.
minor comments (2)
- [Abstract] The abstract asserts 'exact solvability' and 'optimal upper bound' without referencing specific equations or sections, which could be clarified for readers.
- [Throughout] Ensure all notation for n-gram orders and selection functions is defined before use to improve readability.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for recognizing the value of the exact solvability and the separation of drift from selection. We respond to the single major comment below.
read point-by-point responses
-
Referee: [Model and derivation sections (e.g., §2–§4)] The assumption that variable-order n-gram agents sufficiently model LLM text generation to capture essential drift and selection dynamics is load-bearing for the central claim. The exact stable distributions are derived only for the n-gram case, but no demonstration is provided that the optimal upper bound on divergence survives when the generator is replaced by a context-aware model like a transformer, which can maintain rare forms and semantic novelty through long-range attention.
Authors: The manuscript develops an exactly solvable framework using variable-order n-gram agents precisely because this class permits closed-form characterization of the stable distributions under pure drift (see §3 and the infinite-corpus limit derivations). The central claim is therefore scoped to this model: drift produces a unique shallow equilibrium, while normative selection imposes an optimal upper bound on divergence from that equilibrium. We do not assert that the quantitative bound is identical for transformers. However, the qualitative mechanism—preferential reinforcement of high-probability local patterns leading to loss of rare forms—remains operative in any statistical generator trained on its own outputs, even when long-range attention is present. Transformers can reduce but not eliminate this drift without external normative filters. We will add a short limitations paragraph in §5 that (i) reiterates the modeling choice, (ii) notes that attention-based models may alter the rate of drift, and (iii) states that the qualitative distinction between non-normative compression and normative preservation of structure is expected to generalize. This constitutes a partial revision. revision: partial
Circularity Check
No significant circularity detected
full rationale
The derivation begins from an explicit variable-order n-gram agent model and proceeds to exact characterizations of stable distributions in the infinite-corpus limit. Drift is defined as unfiltered reuse and selection as normative filtering; both are introduced as modeling choices rather than derived from the target quantities. No equation reduces a claimed prediction to a fitted parameter by construction, no uniqueness theorem is imported from prior self-work, and no ansatz is smuggled via citation. The central bound on divergence under normative publication follows mathematically from the stated transition rules and selection functions without tautological redefinition of inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Variable-order n-gram agents sufficiently model LLM text generation behavior
- standard math Infinite-corpus limit yields well-defined stable distributions under unfiltered reuse
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1(b): complete characterisation of fixed points ... convex polytope ... circulations on the de Bruijn graph B(n−1,s) ... dimension sn−1(s−1)
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 2(b): KL divergence ... bounded above by L log 2 s bits ... normative selection is self-sustaining
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Shumailov, I.et al.AI models collapse when trained on recursively generated data.Nature631, 755–759 (2024)
work page 2024
-
[2]
InInternational Conference on Learning Representations (ICLR)(2024)
Alemohammad, S.et al.Self-consuming generative models go MAD. InInternational Conference on Learning Representations (ICLR)(2024)
work page 2024
-
[3]
Bohacek, M. & Farid, H. Nepotistically trained generative-AI models collapse. Preprint at arXiv:2311.12202 (2023)
-
[4]
InFindings of the Association for Computational Linguistics: NAACL 2024, 3589–3604 (2024)
Guo, Y .et al.The curious decline of linguistic diversity: training language models on synthetic text. InFindings of the Association for Computational Linguistics: NAACL 2024, 3589–3604 (2024)
work page 2024
-
[5]
Shannon, C. E. Prediction and entropy of printed English.Bell Syst. Tech. J.30, 50–64 (1951)
work page 1951
-
[6]
Katz, S. M. Estimation of probabilities from sparse data for the language model component of a speech recognizer.IEEE Trans. Acoust. Speech Signal Process.35(3), 400–401 (1987)
work page 1987
- [7]
-
[8]
Bengio, Y ., Ducharme, R., Vincent, P. & Jauvin, C. A neural probabilistic language model.J. Mach. Learn. Res.3, 1137–1155 (2003)
work page 2003
-
[9]
Griffiths, T. L. & Kalish, M. L. Language evolution by iterated learning with Bayesian agents.Cognitive Science31, 441–480 (2007)
work page 2007
-
[10]
InNeurIPS35, 24824–24837 (2022)
Wei, J.et al.Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS35, 24824–24837 (2022)
work page 2022
-
[11]
InInternational Conference on Learning Representations (ICLR)(2023)
Wang, X.et al.Self-consistency improves chain of thought reasoning. InInternational Conference on Learning Representations (ICLR)(2023)
work page 2023
-
[12]
InInternational Conference on Learning Representations (ICLR)(2025)
Feng, Y .et al.Beyond model collapse: scaling up with synthesized data requires verification. InInternational Conference on Learning Representations (ICLR)(2025)
work page 2025
-
[13]
Gillman, N.et al.Self-correcting self-consuming loops for generative model training. InProc. 41st International Conference on Machine Learning,PMLR235, 15646–15677 (2024)
work page 2024
- [14]
-
[15]
J.Mathematical Population Genetics I: Theoretical Introduction
Ewens, W. J.Mathematical Population Genetics I: Theoretical Introduction. (Springer, 2nd edn, 2004)
work page 2004
-
[16]
Preprint at arXiv:2404.01413 (2024)
Gerstgrasser, M.et al.Is model collapse inevitable? Breaking the curse of recursion by accumulating real and synthetic data. Preprint at arXiv:2404.01413 (2024)
-
[17]
InInternational Conference on Learning Representations (ICLR)(2024)
Bertrand, Q.et al.On the stability of iterative retraining of generative models on their own data. InInternational Conference on Learning Representations (ICLR)(2024)
work page 2024
- [18]
-
[19]
Dodge, J.et al.Documenting large webtext corpora: a case study on the Colossal Clean Crawled Corpus. InProc. Conf. Empirical Methods in Natural Language Processing1286–1305 (2021)
work page 2021
-
[20]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Penedo, G.et al.The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. Preprint at arXiv:2406.17557 (2024). 40
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
DataComp-LM: In search of the next generation of training sets for language models
DataComp-LM Consortium. DataComp-LM: In search of the next generation of training sets for language models. InAdvances in Neural Information Processing Systems37(Datasets and Benchmarks Track) (2024)
work page 2024
- [22]
- [23]
-
[24]
InAdvances in Neural Information Processing Systems35, 27730–27744 (2022)
Ouyang, L.et al.Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems35, 27730–27744 (2022)
work page 2022
-
[25]
Lightman, H.et al.Let’s verify step by step. Preprint at arXiv:2305.20050 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Cover, T. M. & Thomas, J. A.Elements of Information Theory. (Wiley, 2nd edn, 2006)
work page 2006
-
[27]
Kirby, S., Griffiths, T. & Smith, K. Iterated learning and the evolution of language.Current Opinion in Neurobiology28, 108–114 (2014)
work page 2014
-
[28]
T. van Aardenne-Ehrenfest and N. G. de Bruijn. Circuits and trees in oriented linear graphs.Simon Stevin, 28:203–217, 1951
work page 1951
-
[29]
Schrijver, A.Combinatorial Optimization: Polyhedra and Efficiency. (Springer, 2003). 41
work page 2003
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.