pith. machine review for the scientific record. sign in

arxiv: 2604.08554 · v1 · submitted 2026-03-15 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Drift and selection in LLM text ecosystems

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:46 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM text generationrecursive corporadrift and selectionn-gram modelsmodel collapsepublic text recordnormative selectiontext ecosystems
0
0 comments X

The pith

Normative selection sustains deeper structure in recursive LLM text generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an exactly solvable framework using variable-order n-gram agents to model how generated text re-enters the public corpus and shapes future outputs. It separates drift, where unfiltered reuse erases rare forms and drives the corpus toward shallow equilibria with no benefit from longer contexts, from selection, where filters based on quality or novelty determine persistence. Under neutral reflection of current statistics the system converges to minimal complexity, but normative publication prevents this collapse. The authors derive an optimal upper bound on how far the resulting distributions can diverge from those shallow states. The result identifies conditions under which recursive publication compresses public text versus conditions under which selective filtering maintains richer structure.

Core claim

In the variable-order n-gram model of recursive text generation, unfiltered drift produces stable distributions exactly characterized by the progressive loss of rare forms and higher-order dependencies. When publication applies normative filters that reward quality, correctness, or novelty, deeper dependencies persist and the divergence from the shallow equilibrium is bounded from above by an optimal constant that the authors derive.

What carries the argument

Variable-order n-gram agents, which generate text by conditioning predictions on contexts of varying lengths and allow exact computation of stable distributions under drift and selection.

If this is right

  • Unfiltered reuse of generated text drives the public corpus to shallow equilibria where additional context provides no predictive gain.
  • Normative filters that reward quality or novelty prevent convergence to those shallow states.
  • The divergence from shallow equilibria under normative selection is bounded by an optimal constant.
  • The framework directly informs how training corpora for future models should be filtered to preserve or compress linguistic structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Repeated self-training loops on uncurated LLM outputs are likely to reduce linguistic diversity over time.
  • Corpus builders could apply explicit quality or novelty thresholds to counteract drift and retain richer training material.
  • The same drift-selection distinction may apply to recursive generation in code, images, or other modalities.
  • Direct tests could track changes in n-gram entropy across generations of real LLM output.

Load-bearing premise

Variable-order n-gram agents provide a sufficient model of LLM text generation to capture the essential dynamics of drift and selection in the recursive public corpus.

What would settle it

An empirical measurement of higher-order n-gram frequencies in successively generated LLM corpora showing either convergence to the exact shallow equilibrium under unfiltered reuse or maintenance of statistics above the derived bound under normative selection.

Figures

Figures reproduced from arXiv: 2604.08554 by S{\o}ren Riis.

Figure 1
Figure 1. Figure 1: Recursive text as drift plus selection. A finite sample of the environment is used to fit a short-context genera￾tor. The generator produces synthetic text. What re-enters the environment can be unfiltered (drift) or filtered by a success criterion (selection). matched without recovering deeper generative structure. By contrast, normative selection—publication rules that reward quality, correctness or nove… view at source ↗
Figure 2
Figure 2. Figure 2: shows the distinction in a matched exact experi￾ment. In the descriptive recursion, the KL divergence between the corpus r-gram distribution and the rollout from its induced 0 5 10 15 20 25 30 35 40 Generation 10 5 10 4 10 3 10 2 10 1 K L( t Gr(Rn( t))) Strong KL gap Descriptive Normative 0 5 10 15 20 25 30 35 40 Generation 0.0 0.1 0.2 0.3 0.4 t Gr(Rn( t)) 1 Strong L 1 gap Descriptive Normative 0 5 10 15 2… view at source ↗
Figure 3
Figure 3. Figure 3: Recursive resampling concentrates support; filtering redirects it in the Conan Doyle corpus. All panels are computed from the recursive loop run on a trigram model fitted to the public-domain Arthur Conan Doyle fiction corpus used for the main-text illustration. Under neutral recursion, finite resampling erodes weakly supported higher-order structure and pushes the generator towards more generic continuati… view at source ↗
Figure 4
Figure 4. Figure 4: Vocabulary retention declines under recursive resampling. Across Doyle, Austen, and Darwin, stronger replacement fractions produce stronger contraction. Full replacement (α = 1) yields the steepest decline by genera￾tion 12. 0 2 4 6 8 10 12 Generation 0.2 0.4 0.6 0.8 1.0 Trigram-type retention Conan Doyle 0 2 4 6 8 10 12 Generation Jane Austen 0 2 4 6 8 10 12 Generation Charles Darwin = 0.25 = 0.5 = 1.0 [… view at source ↗
Figure 5
Figure 5. Figure 5: High-order support contracts faster than vocabulary. Trigram-type retention falls more sharply than vocabulary retention, showing the fragility of higher-order continuation structure. 1.6 Experiment: vocabulary contraction in Doyle, Austen, and Darwin The repository contains a notebook that runs the recursive loop on three cleaned literary corpora: Arthur Conan Doyle, Jane Austen, and Charles Darwin. Each … view at source ↗
Figure 6
Figure 6. Figure 6: Active-vocabulary retention above the baseline threshold 1/M0. Even thresholded active vocabulary shows the same monotone concentration pattern as α increases [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The positive side: orthographic standardisation in an Austen pilot (0.5% injection rate). Active variant forms decline after generation 1; canonical share rises; but rare clean-vocabulary retention also falls. (c) If g = (s, a) is a full-order pattern with |s| = n − 1 and Ng = 0, then the retrained model satisfies pb(a | s) = 0. In an unsmoothed model without back-off at that context, this loss is absorbin… view at source ↗
Figure 8
Figure 8. Figure 8: Matched exact theorem-2 diagnostics. The descriptive recursion drives both the KL divergence and the L 1 gap essentially to zero, while the normative recursion settles to a stable nonzero plateau. The entropy panels show that the corpus distribution and induced trigram continuation law can both stabilise even when the KL divergence remains positive. Results. By generation 40, the descriptive run has driven… view at source ↗
Figure 9
Figure 9. Figure 9: The stable normative gap has visible structure. Left: the largest final 5-gram probability mismatches between the corpus distribution and the rollout from its induced trigram continuation law. Right: the prefixes contributing most to the total L 1 gap. The right panel visualises the prefix component only; in this genuinely 3-deep target, the suffix-conditional component is also nonzero. 5.5 Where the norma… view at source ↗
read the original abstract

The public text record -- the material from which both people and AI systems now learn -- is increasingly shaped by its own outputs. Generated text enters the public record, later agents learn from it, and the cycle repeats. Here we develop an exactly solvable mathematical framework for this recursive process, based on variable-order $n$-gram agents, and separate two forces acting on the public corpus. The first is drift: unfiltered reuse progressively removes rare forms, and in the infinite-corpus limit we characterise the stable distributions exactly. The second is selection: publication, ranking and verification filter what enters the record, and the outcome depends on what is selected. When publication merely reflects the statistical status quo, the corpus converges to a shallow state in which further lookahead brings no benefit. When publication is normative -- rewarding quality, correctness or novelty -- deeper structure persists, and we establish an optimal upper bound on the resulting divergence from shallow equilibria. The framework therefore identifies when recursive publication compresses public text and when selective filtering sustains richer structure, with implications for the design of AI training corpora.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper develops an exactly solvable mathematical framework using variable-order n-gram agents to analyze the recursive dynamics of the public text corpus influenced by LLM-generated outputs. It distinguishes drift, which leads to the loss of rare forms with exact characterizations of stable distributions in the infinite-corpus limit, from selection effects driven by publication, ranking, and verification. The key result is that normative selection—rewarding quality, correctness, or novelty—maintains deeper structure, establishing an optimal upper bound on divergence from shallow equilibria, while non-normative publication leads to compression.

Significance. Should the derivations prove robust, this framework provides a rigorous, parameter-free tool for understanding how self-referential text generation affects linguistic diversity and complexity. The exact solvability and separation of forces offer clear implications for designing AI training corpora to preserve richness, crediting the machine-checked or exact nature of the stable distributions as a strength.

major comments (1)
  1. [Model and derivation sections (e.g., §2–§4)] The assumption that variable-order n-gram agents sufficiently model LLM text generation to capture essential drift and selection dynamics is load-bearing for the central claim. The exact stable distributions are derived only for the n-gram case, but no demonstration is provided that the optimal upper bound on divergence survives when the generator is replaced by a context-aware model like a transformer, which can maintain rare forms and semantic novelty through long-range attention.
minor comments (2)
  1. [Abstract] The abstract asserts 'exact solvability' and 'optimal upper bound' without referencing specific equations or sections, which could be clarified for readers.
  2. [Throughout] Ensure all notation for n-gram orders and selection functions is defined before use to improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for recognizing the value of the exact solvability and the separation of drift from selection. We respond to the single major comment below.

read point-by-point responses
  1. Referee: [Model and derivation sections (e.g., §2–§4)] The assumption that variable-order n-gram agents sufficiently model LLM text generation to capture essential drift and selection dynamics is load-bearing for the central claim. The exact stable distributions are derived only for the n-gram case, but no demonstration is provided that the optimal upper bound on divergence survives when the generator is replaced by a context-aware model like a transformer, which can maintain rare forms and semantic novelty through long-range attention.

    Authors: The manuscript develops an exactly solvable framework using variable-order n-gram agents precisely because this class permits closed-form characterization of the stable distributions under pure drift (see §3 and the infinite-corpus limit derivations). The central claim is therefore scoped to this model: drift produces a unique shallow equilibrium, while normative selection imposes an optimal upper bound on divergence from that equilibrium. We do not assert that the quantitative bound is identical for transformers. However, the qualitative mechanism—preferential reinforcement of high-probability local patterns leading to loss of rare forms—remains operative in any statistical generator trained on its own outputs, even when long-range attention is present. Transformers can reduce but not eliminate this drift without external normative filters. We will add a short limitations paragraph in §5 that (i) reiterates the modeling choice, (ii) notes that attention-based models may alter the rate of drift, and (iii) states that the qualitative distinction between non-normative compression and normative preservation of structure is expected to generalize. This constitutes a partial revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation begins from an explicit variable-order n-gram agent model and proceeds to exact characterizations of stable distributions in the infinite-corpus limit. Drift is defined as unfiltered reuse and selection as normative filtering; both are introduced as modeling choices rather than derived from the target quantities. No equation reduces a claimed prediction to a fitted parameter by construction, no uniqueness theorem is imported from prior self-work, and no ansatz is smuggled via citation. The central bound on divergence under normative publication follows mathematically from the stated transition rules and selection functions without tautological redefinition of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The model rests on the domain assumption that variable-order n-gram agents capture LLM dynamics and on the mathematical limit of an infinite corpus for characterizing stable distributions.

axioms (2)
  • domain assumption Variable-order n-gram agents sufficiently model LLM text generation behavior
    Stated as the basis of the framework in the abstract.
  • standard math Infinite-corpus limit yields well-defined stable distributions under unfiltered reuse
    Invoked to characterize drift exactly.

pith-pipeline@v0.9.0 · 5474 in / 1282 out tokens · 44166 ms · 2026-05-15T11:46:47.531068+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 2 internal anchors

  1. [1]

    Shumailov, I.et al.AI models collapse when trained on recursively generated data.Nature631, 755–759 (2024)

  2. [2]

    InInternational Conference on Learning Representations (ICLR)(2024)

    Alemohammad, S.et al.Self-consuming generative models go MAD. InInternational Conference on Learning Representations (ICLR)(2024)

  3. [3]

    & Farid, H

    Bohacek, M. & Farid, H. Nepotistically trained generative-AI models collapse. Preprint at arXiv:2311.12202 (2023)

  4. [4]

    InFindings of the Association for Computational Linguistics: NAACL 2024, 3589–3604 (2024)

    Guo, Y .et al.The curious decline of linguistic diversity: training language models on synthetic text. InFindings of the Association for Computational Linguistics: NAACL 2024, 3589–3604 (2024)

  5. [5]

    Shannon, C. E. Prediction and entropy of printed English.Bell Syst. Tech. J.30, 50–64 (1951)

  6. [6]

    Katz, S. M. Estimation of probabilities from sparse data for the language model component of a speech recognizer.IEEE Trans. Acoust. Speech Signal Process.35(3), 400–401 (1987)

  7. [7]

    & Ney, H

    Kneser, R. & Ney, H. Improved backing-off for m-gram language modeling. InProc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), V ol. 1, 181–184 (1995)

  8. [8]

    & Jauvin, C

    Bengio, Y ., Ducharme, R., Vincent, P. & Jauvin, C. A neural probabilistic language model.J. Mach. Learn. Res.3, 1137–1155 (2003)

  9. [9]

    Griffiths, T. L. & Kalish, M. L. Language evolution by iterated learning with Bayesian agents.Cognitive Science31, 441–480 (2007)

  10. [10]

    InNeurIPS35, 24824–24837 (2022)

    Wei, J.et al.Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS35, 24824–24837 (2022)

  11. [11]

    InInternational Conference on Learning Representations (ICLR)(2023)

    Wang, X.et al.Self-consistency improves chain of thought reasoning. InInternational Conference on Learning Representations (ICLR)(2023)

  12. [12]

    InInternational Conference on Learning Representations (ICLR)(2025)

    Feng, Y .et al.Beyond model collapse: scaling up with synthesized data requires verification. InInternational Conference on Learning Representations (ICLR)(2025)

  13. [13]

    Gillman, N.et al.Self-correcting self-consuming loops for generative model training. InProc. 41st International Conference on Machine Learning,PMLR235, 15646–15677 (2024)

  14. [14]

    R.Markov Chains

    Norris, J. R.Markov Chains. (Cambridge Univ. Press, 1997)

  15. [15]

    J.Mathematical Population Genetics I: Theoretical Introduction

    Ewens, W. J.Mathematical Population Genetics I: Theoretical Introduction. (Springer, 2nd edn, 2004)

  16. [16]

    Preprint at arXiv:2404.01413 (2024)

    Gerstgrasser, M.et al.Is model collapse inevitable? Breaking the curse of recursion by accumulating real and synthetic data. Preprint at arXiv:2404.01413 (2024)

  17. [17]

    InInternational Conference on Learning Representations (ICLR)(2024)

    Bertrand, Q.et al.On the stability of iterative retraining of generative models on their own data. InInternational Conference on Learning Representations (ICLR)(2024)

  18. [18]

    & Arai, H

    Hataya, R., Bao, H. & Arai, H. Will large-scale generative models corrupt future datasets? InProc. IEEE/CVF Int. Conf. Computer Vision20555–20565 (2023)

  19. [19]

    Dodge, J.et al.Documenting large webtext corpora: a case study on the Colossal Clean Crawled Corpus. InProc. Conf. Empirical Methods in Natural Language Processing1286–1305 (2021)

  20. [20]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    Penedo, G.et al.The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. Preprint at arXiv:2406.17557 (2024). 40

  21. [21]

    DataComp-LM: In search of the next generation of training sets for language models

    DataComp-LM Consortium. DataComp-LM: In search of the next generation of training sets for language models. InAdvances in Neural Information Processing Systems37(Datasets and Benchmarks Track) (2024)

  22. [22]

    Natl Acad

    Husz ´ar, F.et al.Algorithmic amplification of politics on Twitter.Proc. Natl Acad. Sci. USA119, e2025334119 (2022)

  23. [23]

    & Garz, M

    Dujeancourt, A. & Garz, M. Algorithmic content selection and the impact on news consumption. Preprint at SSRN:4450616 (2023)

  24. [24]

    InAdvances in Neural Information Processing Systems35, 27730–27744 (2022)

    Ouyang, L.et al.Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems35, 27730–27744 (2022)

  25. [25]

    Let's Verify Step by Step

    Lightman, H.et al.Let’s verify step by step. Preprint at arXiv:2305.20050 (2023)

  26. [26]

    Cover, T. M. & Thomas, J. A.Elements of Information Theory. (Wiley, 2nd edn, 2006)

  27. [27]

    & Smith, K

    Kirby, S., Griffiths, T. & Smith, K. Iterated learning and the evolution of language.Current Opinion in Neurobiology28, 108–114 (2014)

  28. [28]

    van Aardenne-Ehrenfest and N

    T. van Aardenne-Ehrenfest and N. G. de Bruijn. Circuits and trees in oriented linear graphs.Simon Stevin, 28:203–217, 1951

  29. [29]

    (Springer, 2003)

    Schrijver, A.Combinatorial Optimization: Polyhedra and Efficiency. (Springer, 2003). 41