Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Pierre-Carl Langlais , Pavel Chizhov , Catherine Arnett , Carlos Rosas Hinostroza , Mattia Nee , Eliot Krzystof Jones , Ir\`ene Girard , David Mach

show 2 more authors

Anastasia Stasenko Ivan P. Yamshchikov

Authors on Pith no claims yet

classification 💻 cs.CL

keywords datacommoncorpuslargemodelsdatasetopenpre-training

0 comments

read the original abstract

Large Language Models (LLMs) are pre-trained on large data from different sources and domains. These datasets often contain trillions of tokens, including large portions of copyrighted or proprietary content, which raises questions about the legal use of such models. This underscores the need for truly open pre-training data that complies with data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for LLM pre-training. The data assembled in Common Corpus are either uncopyrighted or under open licenses and amount to about two trillion tokens. The dataset contains a wide variety of languages, ranging from the high-resource European languages to some low-resource languages rarely represented in pre-training datasets. In addition, it includes a large amount of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs in diverse areas of knowledge. In this paper, we present the detailed provenance of data assembling and the details of dataset filtering and curation. We train two small language models on Common Corpus and find that they perform comparably to other models of their size, indicating that our dataset is suitable for multilingual pretraining. Common Corpus represents a key contribution to the ecosystem for open science research on Large Language Models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering
cs.CL 2026-04 unverdicted novelty 8.0

RespondeoQA is the first benchmark dataset for question answering and translation between Latin and English, with 7,800 pairs from pedagogical sources and initial LLM evaluations.
The Crutch or the Ceiling? How Different Generations of LLMs Shape EFL Student Writings
cs.HC 2026-04 unverdicted novelty 4.0

Advanced LLMs improve EFL writing scores and diversity for lower-proficiency students but correlate with lower expert ratings on deep coherence, acting more as crutches than scaffolds.