Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
hub
The Curse of Recursion: Training on Generated Data Makes Models Forget
22 Pith papers cite this work. Polarity classification is still indexing.
abstract
Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper we consider what the future might hold. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Model collapse occurs in structured interactive learning if and only if the directed interaction graph satisfies a specific topological condition, with finite-sample guarantees for linear regression and asymptotic results for M-estimators.
Base model text evades AI detectors better than instruction-tuned text, and the HIP method strengthens this trade-off across model sizes.
In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largely reflects state reset.
RLSpoofer trains a 4B model on 100 watermarked paraphrase pairs to spoof PF watermarks at 62% success rate, far exceeding baselines trained on up to 10,000 samples.
A two-stage synthetic data generation method creates the CommonSyn dataset, allowing LLMs fine-tuned on it to produce more diverse and higher-quality commonsense responses than vanilla or human-data-trained models.
EmbGen creates synthetic QA data by entity decomposition, embedding-based reassembly into clusters, and multi-level sampling with cluster-specific prompts, yielding up to 88.9% higher Binary Accuracy than baselines on heterogeneous datasets under fixed token budgets.
Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.
CSRS improves MLLM self-evolution stability by using retracing mechanisms and softened continuous rewards instead of majority voting, reaching SOTA on geometric reasoning benchmarks like MathVision.
LLM use for essay writing correlates with reduced brain network connectivity, lower self-reported ownership, and poorer recall of one's own content compared to unaided or search-based writing.
HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.
ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.
A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
AgentSim creates and releases the Agent-Trace Corpus of over 103,000 verifiable reasoning steps across three IR benchmarks with claimed 100% grounding on substantive answers.
Post-hoc mitigation cannot retroactively cure infringement that occurred during unauthorized data ingestion and training because liability attaches to data lineage and retained expressive value in model weights.
Position paper warns that model collapse in self-consuming multilingual LLM training loops risks flattening linguistic diversity and cultural nuance.
A compensation-based incentive scheme for human creators on content platforms can increase high-value original content, reduce GenAI data pollution, and raise platform profits without needing AI detectors.
Researchers clustered 41,300 Moltbook posts from AI agents with k-means and retrieval-augmented generation to produce validated personas that represent behavioral diversity in agent populations.
Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.
citing papers explorer
-
Evaluating Very Long-Term Conversational Memory of LLM Agents
Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
-
Base Models Look Human To AI Detectors
Base model text evades AI detectors better than instruction-tuned text, and the HIP method strengthens this trade-off across model sizes.
-
Synthetic Data Generation for Training Diversified Commonsense Reasoning Models
A two-stage synthetic data generation method creates the CommonSyn dataset, allowing LLMs fine-tuned on it to produce more diverse and higher-quality commonsense responses than vanilla or human-data-trained models.
-
EmbGen: Teaching with Reassembled Corpora
EmbGen creates synthetic QA data by entity decomposition, embedding-based reassembly into clusters, and multi-level sampling with cluster-specific prompts, yielding up to 88.9% higher Binary Accuracy than baselines on heterogeneous datasets under fixed token budgets.
-
Annotations Mitigate Post-Training Mode Collapse
Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.
-
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
-
Scaling Synthetic Data Creation with 1,000,000,000 Personas
A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.
-
Reinforced Self-Training (ReST) for Language Modeling
ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.
-
Textbooks Are All You Need
A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
-
Losing our Tail, Again: (Un)Natural Selection & Multilingual LLMs
Position paper warns that model collapse in self-consuming multilingual LLM training loops risks flattening linguistic diversity and cultural nuance.
- When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming