hub

The Curse of Recursion: Training on Generated Data Makes Models Forget

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, Ross Anderson · 2023 · cs.LG · arXiv 2305.17493

22 Pith papers cite this work. Polarity classification is still indexing.

22 Pith papers citing it

open full Pith review browse 22 citing papers arXiv PDF

abstract

Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper we consider what the future might hold. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 baseline 1

citation-polarity summary

background 2 baseline 1

representative citing papers

Evaluating Very Long-Term Conversational Memory of LLM Agents

cs.CL · 2024-02-27 · unverdicted · novelty 8.0

Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.

When Does Model Collapse Occur in Structured Interactive Learning?

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

Model collapse occurs in structured interactive learning if and only if the directed interaction graph satisfies a specific topological condition, with finite-sample guarantees for linear regression and asymptotic results for M-estimators.

Base Models Look Human To AI Detectors

cs.CL · 2026-05-19 · unverdicted · novelty 7.0

Base model text evades AI detectors better than instruction-tuned text, and the HIP method strengthens this trade-off across model sizes.

Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates

cs.AI · 2026-05-04 · unverdicted · novelty 7.0

In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largely reflects state reset.

RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience

cs.CR · 2026-04-13 · unverdicted · novelty 7.0

RLSpoofer trains a 4B model on 100 watermarked paraphrase pairs to spoof PF watermarks at 62% success rate, far exceeding baselines trained on up to 10,000 samples.

Synthetic Data Generation for Training Diversified Commonsense Reasoning Models

cs.CL · 2026-03-18 · unverdicted · novelty 7.0

A two-stage synthetic data generation method creates the CommonSyn dataset, allowing LLMs fine-tuned on it to produce more diverse and higher-quality commonsense responses than vanilla or human-data-trained models.

EmbGen: Teaching with Reassembled Corpora

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

EmbGen creates synthetic QA data by entity decomposition, embedding-based reassembly into clusters, and multi-level sampling with cluster-specific prompts, yielding up to 88.9% higher Binary Accuracy than baselines on heterogeneous datasets under fixed token budgets.

Annotations Mitigate Post-Training Mode Collapse

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.

Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling

cs.CV · 2026-04-04 · unverdicted · novelty 6.0

CSRS improves MLLM self-evolution stability by using retracing mechanisms and softened continuous rewards instead of majority voting, reaching SOTA on geometric reasoning benchmarks like MathVision.

Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task

cs.AI · 2025-06-10 · unverdicted · novelty 6.0

LLM use for essay writing correlates with reduced brain network connectivity, lower self-reported ownership, and poorer recall of one's own content compared to unaided or search-based writing.

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

cs.CL · 2024-12-25 · unverdicted · novelty 6.0

HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.

Scaling Synthetic Data Creation with 1,000,000,000 Personas

cs.CL · 2024-06-28 · unverdicted · novelty 6.0

A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.

Reinforced Self-Training (ReST) for Language Modeling

cs.CL · 2023-08-17 · unverdicted · novelty 6.0

ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.

Textbooks Are All You Need

cs.CL · 2023-06-20 · unverdicted · novelty 6.0

A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.

AgentSim: A Platform for Verifiable Agent-Trace Simulation

cs.IR · 2026-04-29 · unverdicted · novelty 5.0

AgentSim creates and releases the Agent-Trace Corpus of over 103,000 verifiable reasoning steps across three IR benchmarks with claimed 100% grounding on substantive answers.

Position: No Retroactive Cure for Infringement during Training

cs.CR · 2026-04-20 · unverdicted · novelty 5.0

Post-hoc mitigation cannot retroactively cure infringement that occurred during unauthorized data ingestion and training because liability attaches to data lineage and retained expressive value in model weights.

Losing our Tail, Again: (Un)Natural Selection & Multilingual LLMs

cs.CL · 2025-07-05 · unverdicted · novelty 4.0

Position paper warns that model collapse in self-consuming multilingual LLM training loops risks flattening linguistic diversity and cultural nuance.

Content Platform GenAI Regulation via Compensation

cs.CY · 2026-03-12 · unverdicted · novelty 3.0

A compensation-based incentive scheme for human creators on content platforms can increase high-value original content, reduce GenAI data pollution, and raise platform profits without needing AI detectors.

How to Model AI Agents as Personas?: Applying the Persona Ecosystem Playground to 41,300 Posts on Moltbook for Behavioral Insights

cs.HC · 2026-03-03 · unverdicted · novelty 3.0

Researchers clustered 41,300 Moltbook posts from AI agents with k-means and retrieval-augmented generation to produce validated personas that represent behavioral diversity in agent populations.

Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices

cs.DC · 2025-03-11 · unverdicted · novelty 2.0

Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.

When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming

cs.CL · 2026-05-22

Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences

cs.LG · 2026-05-08

citing papers explorer

Showing 11 of 11 citing papers after filters.

Evaluating Very Long-Term Conversational Memory of LLM Agents cs.CL · 2024-02-27 · unverdicted · none · ref 41 · internal anchor
Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
Base Models Look Human To AI Detectors cs.CL · 2026-05-19 · unverdicted · none · ref 13 · internal anchor
Base model text evades AI detectors better than instruction-tuned text, and the HIP method strengthens this trade-off across model sizes.
Synthetic Data Generation for Training Diversified Commonsense Reasoning Models cs.CL · 2026-03-18 · unverdicted · none · ref 3 · internal anchor
A two-stage synthetic data generation method creates the CommonSyn dataset, allowing LLMs fine-tuned on it to produce more diverse and higher-quality commonsense responses than vanilla or human-data-trained models.
EmbGen: Teaching with Reassembled Corpora cs.CL · 2026-05-19 · unverdicted · none · ref 25 · internal anchor
EmbGen creates synthetic QA data by entity decomposition, embedding-based reassembly into clusters, and multi-level sampling with cluster-specific prompts, yielding up to 88.9% higher Binary Accuracy than baselines on heterogeneous datasets under fixed token budgets.
Annotations Mitigate Post-Training Mode Collapse cs.CL · 2026-05-11 · unverdicted · none · ref 47 · internal anchor
Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs cs.CL · 2024-12-25 · unverdicted · none · ref 66 · internal anchor
HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
Scaling Synthetic Data Creation with 1,000,000,000 Personas cs.CL · 2024-06-28 · unverdicted · none · ref 21 · internal anchor
A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.
Reinforced Self-Training (ReST) for Language Modeling cs.CL · 2023-08-17 · unverdicted · none · ref 22 · internal anchor
ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.
Textbooks Are All You Need cs.CL · 2023-06-20 · unverdicted · none · ref 27 · internal anchor
A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
Losing our Tail, Again: (Un)Natural Selection & Multilingual LLMs cs.CL · 2025-07-05 · unverdicted · none · ref 48 · internal anchor
Position paper warns that model collapse in self-consuming multilingual LLM training loops risks flattening linguistic diversity and cultural nuance.
When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming cs.CL · 2026-05-22 · unreviewed · ref 31 · internal anchor

The Curse of Recursion: Training on Generated Data Makes Models Forget

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer