A multi-dimensional taxonomy filtering approach recovers high-performing data from deprioritized web corpora, with filtered low-tier subsets outperforming unfiltered top-tier data on reasoning and coding benchmarks.
Essential-web v1.0: 24t tokens of organized web data
2 Pith papers cite this work. Polarity classification is still indexing.
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.
citing papers explorer
-
Unlocking Latent Value: Taxonomy-Guided Recovery of High-Performing Data from Low-Tier Web Corpora
A multi-dimensional taxonomy filtering approach recovers high-performing data from deprioritized web corpora, with filtered low-tier subsets outperforming unfiltered top-tier data on reasoning and coding benchmarks.
-
Scaling Laws for Mixture Pretraining Under Data Constraints
Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.