arxiv: 2305.16264 · v5 · pith:7RK4LJX5new · submitted 2023-05-25 · 💻 cs.CL · cs.AI· cs.LG

Scaling Data-Constrained Language Models

Niklas Muennighoff , Alexander M. Rush , Boaz Barak , Teven Le Scao , Aleksandra Piktus , Nouamane Tazi , Sampo Pyysalo , Thomas Wolf

show 1 more author

Colin Raffel

This is my paper

Pith reviewed 2026-05-18 01:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords data-constrained scalinglanguage modelsrepeated datascaling lawscompute optimalitydata repetition

0 comments

The pith

Repeating training data up to four times has little effect on language model loss for a given compute budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how language models scale when unique training data runs short. Hundreds of runs with models up to 9 billion parameters and budgets up to 900 billion tokens show that repeating the same data up to four times produces loss nearly identical to using fresh data. Beyond four epochs the benefit of extra compute falls sharply toward zero. The authors derive and test a scaling law that predicts optimal compute use once repetition begins to lose value. This matters because total available text may soon cap further scaling.

Core claim

With constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. A scaling law for compute optimality is proposed and validated that accounts for the decreasing value of repeated tokens and excess parameters.

What carries the argument

Scaling law for compute optimality that reduces the effective value of repeated tokens and surplus parameters.

If this is right

Training runs can reuse the same data up to four epochs with almost no extra loss.
Additional compute beyond the optimal repetition point yields no further improvement.
Augmenting the dataset with code or relaxing common filters can partially offset data scarcity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training recipes may shift toward generating fresh synthetic data once repetition costs rise.
Optimal model size may shrink relative to compute when repetition is forced to be high.
Similar repetition limits could appear in other domains that also face finite high-quality data.

Load-bearing premise

The loss patterns measured up to 9 billion parameters and a few epochs of repetition continue unchanged at larger scales and with different data sources.

What would settle it

Train a model at 100 billion parameters on data repeated ten or more times and measure whether final loss follows the proposed scaling law or deviates from its predictions.

read the original abstract

The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are freely available at https://github.com/huggingface/datablations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Repeating data up to four epochs barely hurts loss, but the scaling law for when extra compute stops helping is only checked up to 9B models.

read the letter

The central result is straightforward: with fixed compute and limited unique data, training on up to four epochs of repetition produces loss values close to what you would get with fresh tokens. Past that point the marginal value of more compute drops toward zero. They capture this in a fitted scaling law that treats repeated tokens as having decaying utility and excess parameters as having their own cost. That is the usable takeaway for anyone staring at data ceilings on the next training run.

Referee Report

2 major / 2 minor

Summary. The paper investigates scaling of language models under data constraints by running 400 experiments varying data repetition and compute budget, up to 9B parameters and 900B tokens. It claims that for fixed compute, up to 4 epochs of repeated data yields negligible loss change versus unique data, but further repetition causes the value of added compute to decay to zero. The authors propose and empirically validate a scaling law for compute optimality that incorporates the diminishing returns of repeated tokens and excess parameters, and test mitigations such as adding code data or changing filters. Models and datasets are released publicly.

Significance. If the central empirical findings and scaling law hold beyond the tested regime, the work is significant because it directly addresses the emerging bottleneck of high-quality text data for frontier-scale training. The large experimental grid (400 runs) and public release of models/datasets provide a valuable resource for the community and strengthen the empirical basis for the proposed law relating loss to repetition and compute.

major comments (2)

[Experiments and scaling law sections] Experiments and scaling law sections: the 4-epoch threshold and the claim that additional compute value decays to zero are derived from fits on the same experimental grid up to 9B parameters; no separate validation set or out-of-distribution test at larger scales is reported, which is load-bearing for the extrapolation to future frontier runs.
[Proposed scaling law] Proposed scaling law (around Eq. for compute optimality): the repetition-value decay coefficient is introduced as a fitted parameter; the manuscript should clarify whether this coefficient is dataset-specific or intended to be universal, as this directly affects the claimed generality of the law for different data distributions.

minor comments (2)

[Abstract and results] The abstract states 'negligible changes to loss'; provide quantitative thresholds or statistical tests used to define 'negligible' in the main text or appendix.
[Figures] Figure captions and axis labels for the scaling plots should explicitly note the range of repetition factors and model sizes tested to aid quick assessment of the empirical support.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback and recommendation for major revision. We address each major comment point by point below, providing clarifications and indicating revisions to the manuscript where appropriate.

read point-by-point responses

Referee: [Experiments and scaling law sections] Experiments and scaling law sections: the 4-epoch threshold and the claim that additional compute value decays to zero are derived from fits on the same experimental grid up to 9B parameters; no separate validation set or out-of-distribution test at larger scales is reported, which is load-bearing for the extrapolation to future frontier runs.

Authors: The referee is correct that the 4-epoch threshold and decay-to-zero behavior are identified from fits to our full grid of 400 experiments (up to 9B parameters and 900B tokens). We did not hold out a separate validation set or conduct tests at larger scales. In the revised manuscript we will add a cross-validation analysis (fitting on random subsets of the grid and evaluating predictive accuracy on held-out runs) to demonstrate robustness of the fitted parameters within the tested regime. We will also expand the limitations section to explicitly discuss the risks of extrapolation beyond 9B parameters. However, we lack the resources to run out-of-distribution experiments at frontier scales. revision: partial
Referee: [Proposed scaling law] Proposed scaling law (around Eq. for compute optimality): the repetition-value decay coefficient is introduced as a fitted parameter; the manuscript should clarify whether this coefficient is dataset-specific or intended to be universal, as this directly affects the claimed generality of the law for different data distributions.

Authors: We appreciate the request for clarification. The repetition-value decay coefficient is a fitted parameter obtained from our C4-based experiments and is not presented as a universal constant. In the revised manuscript we will explicitly state that the coefficient is dataset-dependent and should be re-estimated for new data distributions or quality levels. We will also include a short analysis applying the law to our code-augmentation experiments to illustrate its behavior under modest changes in data composition. revision: yes

standing simulated objections not resolved

We cannot conduct additional training runs at scales substantially larger than 9B parameters and 900B tokens due to computational resource constraints.

Circularity Check

0 steps flagged

No significant circularity; empirical scaling law fitted to new experimental grid

full rationale

The paper runs a large suite of new experiments (up to 9B parameters, 900B tokens, varying repetition epochs) and directly observes the effect of data repetition on loss. From these observations it proposes and fits a scaling law for compute optimality. This is standard empirical model-building rather than any reduction of a claimed prediction to prior fitted quantities or self-citations by construction. No equations, uniqueness theorems, or ansatzes are shown to be smuggled in via self-reference; the central claim remains an independent fit to the reported experimental data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on large-scale empirical measurements rather than new theoretical axioms or postulated entities; the scaling law itself contains fitted parameters whose exact count and values are not stated in the abstract.

free parameters (1)

repetition-value decay coefficient
Parameter inside the proposed compute-optimality scaling law that captures how much less useful each additional epoch of repeated data becomes; fitted to the experimental loss curves.

axioms (1)

domain assumption Loss continues to follow a smooth, predictable function of effective compute even when tokens are repeated.
Invoked when extrapolating the new scaling law beyond the measured repetition range.

pith-pipeline@v0.9.0 · 5751 in / 1251 out tokens · 30995 ms · 2026-05-18T01:30:04.724808+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 7.0

Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
Causal inference for social network formation
econ.EM 2026-04 conditional novelty 7.0

Random team assignments in a professional firm reveal that indirect ties strongly increase new direct tie formation, while effects of degree and local density are smaller and less robust.
The Art of Scaling Reinforcement Learning Compute for LLMs
cs.LG 2025-10 unverdicted novelty 7.0

A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale prediction...
OLMo: Accelerating the Science of Language Models
cs.CL 2024-02 accept novelty 7.0

OLMo delivers a fully open competitive language model with training data, code, and evaluations to enable community-driven scientific research on LMs.
Scalable Extraction of Training Data from (Production) Language Models
cs.LG 2023-11 conditional novelty 7.0

Adversaries can scalably extract gigabytes of training data from open, semi-open, and closed language models via querying attacks, including a divergence method that increases extraction rates 150x on aligned models l...
C-Pack: Packed Resources For General Chinese Embeddings
cs.CL 2023-09 accept novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
RWKV: Reinventing RNNs for the Transformer Era
cs.CL 2023-05 unverdicted novelty 7.0

RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.
Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings
cs.LG 2026-05 conditional novelty 6.0

Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 6.0

Expert upcycling expands MoE models by duplicating experts and continuing pre-training, matching baseline performance while saving 32% GPU hours in 7B-13B experiments.
Foundation Models for Discovery and Exploration in Chemical Space
physics.chem-ph 2025-10 unverdicted novelty 6.0

MIST models up to 10x larger than prior work, fine-tuned on over 400 structure-property tasks, match or exceed SOTA on benchmarks and demonstrate zero-shot olfactory perception mapping consistent with hyperbolic geometry.
DataComp-LM: In search of the next generation of training sets for language models
cs.LG 2024-06 unverdicted novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
cs.CL 2024-04 accept novelty 6.0

Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
The Falcon Series of Open Language Models
cs.CL 2023-11 conditional novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
Textbooks Are All You Need
cs.CL 2023-06 unverdicted novelty 6.0

A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
AFRILANGTUTOR: Advancing Language Tutoring and Culture Education in Low-Resource Languages with Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

New dictionary-derived datasets enable fine-tuned LLMs to act as language tutors for ten low-resource African languages, with SFT plus DPO yielding 1.8-15.5% gains on LLM-as-judge metrics.
DOSE: Data Selection for Multi-Modal LLMs via Off-the-Shelf Models
cs.CV 2026-04 unverdicted novelty 5.0

Off-the-shelf models assess quality and alignment to select diverse multimodal training data, letting models trained on the filtered subset match or exceed full-dataset results on standard benchmarks.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder
cs.LG 2026-04 unverdicted novelty 2.0

A reduced attention-only decoder shows diminishing returns in dataset scaling, reaching 90% of full accuracy with only 30% of the data.

Reference graph

Works this paper leans on

145 extracted references · 145 canonical work pages · cited by 17 Pith papers · 50 internal anchors

[1]

Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. 2023. Scaling Laws for Generative Mixed-Modal Language Models. arXiv preprint arXiv:2301.03728

work page arXiv 2023
[2]

Ibrahim M Alabdulmohsin, Behnam Neyshabur, and Xiaohua Zhai. 2022. Revisiting neural scaling laws in language and vision. Advances in Neural Information Processing Systems, 35:22300–22312

work page 2022
[3]

Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Car- los Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. 2023. SantaCoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988

work page arXiv 2023
[4]

Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2015. Suggesting accurate method and class names. In Proceedings of the 2015 10th joint meeting on foundations of software engineering, pages 38–49

work page 2015
[5]

Bach, Victor Sanh, Zheng-Xin Yong, Albert Webson, Colin Raffel, Nihal V

Stephen H. Bach, Victor Sanh, Zheng-Xin Yong, Albert Webson, Colin Raffel, Nihal V . Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, Zaid Alyafeai, Manan Dey, Andrea Santilli, Zhiqing Sun, Srulik Ben-David, Canwen Xu, Gunjan Chhablani, Han Wang, Jason Alan Fries, Maged S. Al-shaibani, Shanya Sharma, Urmish Thakker, Khalid Almubarak, Xi...

work page
[6]

PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts

work page
[7]

Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. 2021. Explain- ing neural scaling laws. arXiv preprint arXiv:2102.06701

work page arXiv 2021
[8]

Yamini Bansal, Behrooz Ghorbani, Ankush Garg, Biao Zhang, Colin Cherry, Behnam Neyshabur, and Orhan Firat. 2022. Data scaling laws in NMT: The effect of noise and architecture. In International Conference on Machine Learning, pages 1466–1482. PMLR

work page 2022
[9]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623

work page 2021
[10]

Zhengda Bian, Hongxin Liu, Boxiang Wang, Haichen Huang, Yongbin Li, Chuanrui Wang, Fan Cui, and Yang You. 2021. Colossal-AI: A unified deep learning system for large-scale parallel training. arXiv preprint arXiv:2110.14883

work page arXiv 2021
[11]

Stella Biderman, USVSN Sai Prashanth, Lintang Sutawika, Hailey Schoelkopf, Quentin Anthony, Shivanshu Purohit, and Edward Raf. 2023. Emergent and Predictable Memorization in Large Language Models. arXiv preprint arXiv:2304.11158. 3https://www.lumi-supercomputer.eu/ 10

work page arXiv 2023
[12]

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. 2023. Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. PIQA: Rea- soning about Physical Commonsense in Natural Language. In Thirty-Fourth AAAI Conference on Artificial Intelligence

work page 2020
[14]

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. 2022. GPT-NeoX-20B: An Open-Source Autoregressive Language Model. arXiv preprint arXiv:2204.06745

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. GPT-Neo: Large scale autoregressive language modeling with mesh-tensorflow. If you use this software, please cite it using these metadata, 58

work page 2021
[16]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877– 1901

work page 2020
[17]

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2022. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Thiago Castro Ferreira, Claire Gardent, Nikolai Ilinykh, Chris van der Lee, Simon Mille, Diego Moussallem, and Anastasia Shimorina. 2020. The 2020 Bilingual, Bi-Directional WebNLG+ Shared Task Overview and Evaluation Results (WebNLG+ 2020). In Proceedings of the 3rd WebNLG Workshop on Natural Language Generation from the Semantic Web (WebNLG+ 2020), pages...

work page 2020
[19]

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative pretraining from pixels. In International conference on machine learning, pages 1691–1703. PMLR

work page 2020
[20]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean,...

work page
[22]

Scaling Instruction-Finetuned Language Models

Scaling Instruction-Finetuned Language Models. arXiv preprint arXiv:2210.11416

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In NAACL

work page 2019
[24]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803.05457v1

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wen- zek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoy- anov. 2019. Unsupervised Cross-lingual Representation Learning at Scale. arXiv preprint arXiv:1911.02116

work page internal anchor Pith review Pith/arXiv arXiv 2019
[26]

Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The pascal recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment: First PASCAL Machine 11 Learning Challenges Workshop, MLCW 2005, Southampton, UK, April 11-13, 2005, Revised Selected...

work page 2006
[27]

Marie-Catherine De Marneffe, Mandy Simons, and Judith Tonhauser. 2019. The commitment- bank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pages 107–124

work page 2019
[28]

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. 2023. Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442

work page arXiv 2023
[29]

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. 2018. Universal transformers. arXiv preprint arXiv:1807.03819

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. 2022. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR

work page 2022
[31]

Ondˇrej Dušek, Jekaterina Novikova, and Verena Rieser. 2020. Evaluating the State-of-the-Art of End-to-End Natural Language Generation: The E2E NLG Challenge. Computer Speech & Language, 59:123–156

work page 2020
[32]

William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res, 23:1–40

work page 2021
[33]

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy

work page
[34]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv preprint arXiv:2101.00027

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2021. A framework for few-shot language model evaluation

work page 2021
[36]

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. 2020. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462

work page internal anchor Pith review Pith/arXiv arXiv 2020
[37]

Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Aremu Anuoluwapo, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna Clinciu, Dipanjan Das, Kaustubh D Dhole, et al. 2021. The gem benchmark: Natural language generation, its evaluation and metrics. arXiv preprint arXiv:2102.01672

work page arXiv 2021
[38]

Behrooz Ghorbani, Orhan Firat, Markus Freitag, Ankur Bapna, Maxim Krikun, Xavier Garcia, Ciprian Chelba, and Colin Cherry. 2021. Scaling laws for neural machine translation. arXiv preprint arXiv:2109.07740

work page arXiv 2021
[39]

Himanshu Gupta, Saurabh Arjun Sawant, Swaroop Mishra, Mutsumi Nakamura, Arindam Mitra, Santosh Mashetty, and Chitta Baral. 2023. Instruction Tuned Models are Quick Learners. arXiv preprint arXiv:2306.05539

work page arXiv 2023
[40]

Kenneth Heafield. 2011. KenLM: Faster and Smaller Language Model Queries. InProceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland. Association for Computational Linguistics

work page 2011
[41]

Peter Henderson, Mark Krass, Lucia Zheng, Neel Guha, Christopher D Manning, Dan Jurafsky, and Daniel Ho. 2022. Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset. Advances in Neural Information Processing Systems, 35:29217– 29234. 12

work page 2022
[42]

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. 2020. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701

work page internal anchor Pith review Pith/arXiv arXiv 2020
[43]

Danny Hernandez, Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Tom Henighan, Tristan Hume, et al. 2022. Scaling Laws and Interpretability of Learning from Repeated Data. arXiv preprint arXiv:2205.10487

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. 2021. Scaling laws for transfer. arXiv preprint arXiv:2102.01293

work page internal anchor Pith review Pith/arXiv arXiv 2021
[45]

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al

work page
[46]

Training Compute-Optimal Large Language Models

Training Compute-Optimal Large Language Models. arXiv preprint arXiv:2203.15556

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M Dai, Matthew D Hoffman, Monica Dinculescu, and Douglas Eck. 2018. Music transformer. arXiv preprint arXiv:1809.04281

work page internal anchor Pith review Pith/arXiv arXiv 2018
[48]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2073–2083

work page 2016
[49]

Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating Training Data Mitigates Privacy Risks in Language Models

work page 2022
[50]

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[51]

Mikhail Khrushchev, Ruslan Vasilev, Alexey Petrov, and Nikolay Zinov. 2022. YaLM 100B

work page 2022
[52]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014
[53]

Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, et al. 2022. The Stack: 3 TB of permissively licensed source code. arXiv preprint arXiv:2211.15533

work page arXiv 2022
[54]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916

work page internal anchor Pith review Pith/arXiv arXiv 2022
[55]

Aran Komatsuzaki. 2019. One epoch is all you need. arXiv preprint arXiv:1906.06669

work page internal anchor Pith review Pith/arXiv arXiv 2019
[56]

Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics

work page 2018
[57]

Faisal Ladhak, Esin Durmus, Claire Cardie, and Kathleen McKeown. 2020. WikiLin- gua: A new benchmark dataset for cross-lingual abstractive summarization. arXiv preprint arXiv:2010.03093

work page arXiv 2020
[58]

Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro V on Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen, et al. 2022. The BigScience ROOTS Corpus: A 1.6 TB Composite Multilingual Dataset. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track

work page 2022
[59]

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2021. Deduplicating Training Data Makes Language Models Better. arXiv preprint arXiv:2107.06499. 13

work page internal anchor Pith review Pith/arXiv arXiv 2021
[60]

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Loge...

work page
[61]

StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161

work page internal anchor Pith review Pith/arXiv arXiv
[62]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. 2022. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110

work page internal anchor Pith review Pith/arXiv arXiv 2022
[63]

Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. 2021. Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs, 1

work page 2021
[64]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81

work page 2004
[65]

Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, et al. 2021. Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668

work page arXiv 2021
[66]

Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li, Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani, Junxian He, Zhisong Zhang, Xuezhe Ma, et al. 2019. Choosing transfer languages for cross-lingual learning. arXiv preprint arXiv:1905.12688

work page internal anchor Pith review Pith/arXiv arXiv 2019
[67]

Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. 2023. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. arXiv preprint arXiv:2301.13688

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Deb Roy, and Sara Hooker. 2023. The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Att...

work page 2023
[69]

Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, and Daphne Ippolito. 2023. A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

work page 2023
[70]

Risto Luukkonen, Ville Komulainen, Jouni Luoma, Anni Eskelinen, Jenna Kanerva, Hanna- Mari Kupari, Filip Ginter, Veronika Laippala, Niklas Muennighoff, Aleksandra Piktus, Thomas Wang, Nouamane Tazi, Teven Le Scao, Thomas Wolf, Osma Suominen, Samuli Sairanen, Mikko Merioksa, Jyrki Heinonen, Aija Vahtola, Samuel Antao, and Sampo Pyysalo. 2023. FinGPT: Large...

work page 2023
[71]

Ali Madani, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R Eguchi, Po-Ssu Huang, and Richard Socher. 2020. Progen: Language modeling for protein generation. arXiv preprint arXiv:2004.03497

work page arXiv 2020
[72]

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The Natural Language Decathlon: Multitask Learning as Question Answering. CoRR, abs/1806.08730

work page internal anchor Pith review Pith/arXiv arXiv 2018
[73]

Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2021. Metaicl: Learning to learn in context. arXiv preprint arXiv:2110.15943. 14

work page arXiv 2021
[74]

Nasrin Mostafazadeh, Michael Roth, Annie Louis, Nathanael Chambers, and James Allen

work page
[75]

In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 46–51

Lsdsem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 46–51

work page 2017
[76]

Niklas Muennighoff. 2020. Vilio: State-of-the-art visio-linguistic models applied to hateful memes. arXiv preprint arXiv:2012.07788

work page arXiv 2020
[77]

Niklas Muennighoff. 2022. SGPT: GPT Sentence Embeddings for Semantic Search. arXiv preprint arXiv:2202.08904

work page arXiv 2022
[78]

Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. 2023. OctoPack: Instruction Tuning Code Large Language Models. arXiv preprint arXiv:2308.07124

work page arXiv 2023
[79]

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2022. MTEB: Massive Text Embedding Benchmark. arXiv preprint arXiv:2210.07316

work page internal anchor Pith review Pith/arXiv arXiv 2022
[80]

Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. 2022. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786

work page arXiv 2022

Showing first 80 references.