arxiv: 2107.06499 · v2 · pith:KGI2ZLF7new · submitted 2021-07-14 · 💻 cs.CL · cs.LG

Deduplicating Training Data Makes Language Models Better

Katherine Lee , Daphne Ippolito , Andrew Nystrom , Chiyuan Zhang , Douglas Eck , Chris Callison-Burch , Nicholas Carlini This is my paper

classification 💻 cs.CL cs.LG

keywords datasetslanguagemodelstrainingbetterdatadeduplicationtimes

0 comments

read the original abstract

We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets -- for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times. Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4% of the validation set of standard datasets, thus allowing for more accurate evaluation. We release code for reproducing our work and performing dataset deduplication at https://github.com/google-research/deduplicate-text-datasets.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
cs.CL 2023-04 accept novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data
cs.LG 2025-09 unverdicted novelty 7.0

Introduces the first active learning framework for unaligned multimodal data that selects alignments using uncertainty and diversity to cut annotation costs by up to 40% on benchmarks while preserving accuracy.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
cs.AI 2024-06 conditional novelty 7.0

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Quantifying Memorization Across Neural Language Models
cs.LG 2022-02 unverdicted novelty 7.0

Memorization in language models increases log-linearly with model capacity, data duplication count, and prompt context length.
Improving language models by retrieving from trillions of tokens
cs.CL 2021-12 unverdicted novelty 7.0

RETRO matches GPT-3 and Jurassic-1 performance on the Pile benchmark using 25 times fewer parameters by conditioning on retrieved chunks from a 2-trillion-token database.
Multitask Prompted Training Enables Zero-Shot Task Generalization
cs.LG 2021-10 conditional novelty 7.0

Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
cs.LG 2021-01 accept novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
cs.SE 2026-05 accept novelty 6.0

A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
MAGI-1: Autoregressive Video Generation at Scale
cs.CV 2025-05 unverdicted novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
cs.CV 2024-03 conditional novelty 6.0

Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
Scaling Data-Constrained Language Models
cs.CL 2023-05 conditional novelty 6.0

Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
The False Promise of Imitating Proprietary LLMs
cs.CL 2023-05 conditional novelty 6.0

Finetuning open LMs on ChatGPT outputs creates models that mimic style and fool human raters but fail to close the performance gap to proprietary systems on tasks not well-represented in the imitation data.
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
cs.LG 2023-03 unverdicted novelty 6.0

SemDeDup removes semantic duplicates from datasets like LAION using pre-trained embeddings, cutting data by 50% with minimal performance loss and efficiency gains on C4.
Emergent Abilities of Large Language Models
cs.CL 2022-06 unverdicted novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
Scaling Laws and Interpretability of Learning from Repeated Data
cs.LG 2022-05 accept novelty 6.0

Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.
PaLM: Scaling Language Modeling with Pathways
cs.CL 2022-04 accept novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
Can Humans Detect AI? Mining Textual Signals of AI-Assisted Writing Under Varying Scrutiny Conditions
cs.HC 2026-04 unverdicted novelty 5.0

Warned AI-assisted writers had their documents selected as human 54.13% of the time by judges versus 45.87% for unwarned writers, despite no measurable differences in text features.
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
cs.AI 2023-08 accept novelty 5.0

Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
PaLM 2 Technical Report
cs.CL 2023-05 unverdicted novelty 5.0

PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Merlin: Deterministic Byte-Exact Deduplication for Lossless Context Optimization in Large Language Model Inference
cs.CL 2026-05 unverdicted novelty 4.0

Merlin achieves byte-exact deduplication of text at up to 8.7 GB/s using SIMD-optimized hashing, reducing LLM context sizes by 13.9-71% with no data loss.
Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks
cs.CL 2026-05 unverdicted novelty 4.0

Byte-exact deduplication reduces RAG context size by 0.16% to 80.34% across three regimes with zero measurable quality regression per multi-vendor LLM evaluation.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
cs.CL 2024-01 unverdicted novelty 4.0

DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector
cs.CL 2025-09 unverdicted novelty 3.0

Fine-tuned LLaMA 3.1-8B variants for the energy sector outperform the base model on domain QA benchmarks, with LoRA delivering similar gains at lower training cost.