Scaling Data-Constrained Language Models
Pith reviewed 2026-05-18 01:30 UTC · model grok-4.3
The pith
Repeating training data up to four times has little effect on language model loss for a given compute budget.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
With constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. A scaling law for compute optimality is proposed and validated that accounts for the decreasing value of repeated tokens and excess parameters.
What carries the argument
Scaling law for compute optimality that reduces the effective value of repeated tokens and surplus parameters.
If this is right
- Training runs can reuse the same data up to four epochs with almost no extra loss.
- Additional compute beyond the optimal repetition point yields no further improvement.
- Augmenting the dataset with code or relaxing common filters can partially offset data scarcity.
Where Pith is reading between the lines
- Training recipes may shift toward generating fresh synthetic data once repetition costs rise.
- Optimal model size may shrink relative to compute when repetition is forced to be high.
- Similar repetition limits could appear in other domains that also face finite high-quality data.
Load-bearing premise
The loss patterns measured up to 9 billion parameters and a few epochs of repetition continue unchanged at larger scales and with different data sources.
What would settle it
Train a model at 100 billion parameters on data repeated ten or more times and measure whether final loss follows the proposed scaling law or deviates from its predictions.
read the original abstract
The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are freely available at https://github.com/huggingface/datablations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates scaling of language models under data constraints by running 400 experiments varying data repetition and compute budget, up to 9B parameters and 900B tokens. It claims that for fixed compute, up to 4 epochs of repeated data yields negligible loss change versus unique data, but further repetition causes the value of added compute to decay to zero. The authors propose and empirically validate a scaling law for compute optimality that incorporates the diminishing returns of repeated tokens and excess parameters, and test mitigations such as adding code data or changing filters. Models and datasets are released publicly.
Significance. If the central empirical findings and scaling law hold beyond the tested regime, the work is significant because it directly addresses the emerging bottleneck of high-quality text data for frontier-scale training. The large experimental grid (400 runs) and public release of models/datasets provide a valuable resource for the community and strengthen the empirical basis for the proposed law relating loss to repetition and compute.
major comments (2)
- [Experiments and scaling law sections] Experiments and scaling law sections: the 4-epoch threshold and the claim that additional compute value decays to zero are derived from fits on the same experimental grid up to 9B parameters; no separate validation set or out-of-distribution test at larger scales is reported, which is load-bearing for the extrapolation to future frontier runs.
- [Proposed scaling law] Proposed scaling law (around Eq. for compute optimality): the repetition-value decay coefficient is introduced as a fitted parameter; the manuscript should clarify whether this coefficient is dataset-specific or intended to be universal, as this directly affects the claimed generality of the law for different data distributions.
minor comments (2)
- [Abstract and results] The abstract states 'negligible changes to loss'; provide quantitative thresholds or statistical tests used to define 'negligible' in the main text or appendix.
- [Figures] Figure captions and axis labels for the scaling plots should explicitly note the range of repetition factors and model sizes tested to aid quick assessment of the empirical support.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and recommendation for major revision. We address each major comment point by point below, providing clarifications and indicating revisions to the manuscript where appropriate.
read point-by-point responses
-
Referee: [Experiments and scaling law sections] Experiments and scaling law sections: the 4-epoch threshold and the claim that additional compute value decays to zero are derived from fits on the same experimental grid up to 9B parameters; no separate validation set or out-of-distribution test at larger scales is reported, which is load-bearing for the extrapolation to future frontier runs.
Authors: The referee is correct that the 4-epoch threshold and decay-to-zero behavior are identified from fits to our full grid of 400 experiments (up to 9B parameters and 900B tokens). We did not hold out a separate validation set or conduct tests at larger scales. In the revised manuscript we will add a cross-validation analysis (fitting on random subsets of the grid and evaluating predictive accuracy on held-out runs) to demonstrate robustness of the fitted parameters within the tested regime. We will also expand the limitations section to explicitly discuss the risks of extrapolation beyond 9B parameters. However, we lack the resources to run out-of-distribution experiments at frontier scales. revision: partial
-
Referee: [Proposed scaling law] Proposed scaling law (around Eq. for compute optimality): the repetition-value decay coefficient is introduced as a fitted parameter; the manuscript should clarify whether this coefficient is dataset-specific or intended to be universal, as this directly affects the claimed generality of the law for different data distributions.
Authors: We appreciate the request for clarification. The repetition-value decay coefficient is a fitted parameter obtained from our C4-based experiments and is not presented as a universal constant. In the revised manuscript we will explicitly state that the coefficient is dataset-dependent and should be re-estimated for new data distributions or quality levels. We will also include a short analysis applying the law to our code-augmentation experiments to illustrate its behavior under modest changes in data composition. revision: yes
- We cannot conduct additional training runs at scales substantially larger than 9B parameters and 900B tokens due to computational resource constraints.
Circularity Check
No significant circularity; empirical scaling law fitted to new experimental grid
full rationale
The paper runs a large suite of new experiments (up to 9B parameters, 900B tokens, varying repetition epochs) and directly observes the effect of data repetition on loss. From these observations it proposes and fits a scaling law for compute optimality. This is standard empirical model-building rather than any reduction of a claimed prediction to prior fitted quantities or self-citations by construction. No equations, uniqueness theorems, or ansatzes are shown to be smuggled in via self-reference; the central claim remains an independent fit to the reported experimental data.
Axiom & Free-Parameter Ledger
free parameters (1)
- repetition-value decay coefficient
axioms (1)
- domain assumption Loss continues to follow a smooth, predictable function of effective compute even when tokens are repeated.
Forward citations
Cited by 18 Pith papers
-
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
-
Causal inference for social network formation
Random team assignments in a professional firm reveal that indirect ties strongly increase new direct tie formation, while effects of degree and local density are smaller and less robust.
-
The Art of Scaling Reinforcement Learning Compute for LLMs
A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale prediction...
-
OLMo: Accelerating the Science of Language Models
OLMo delivers a fully open competitive language model with training data, code, and evaluations to enable community-driven scientific research on LMs.
-
Scalable Extraction of Training Data from (Production) Language Models
Adversaries can scalably extract gigabytes of training data from open, semi-open, and closed language models via querying attacks, including a divergence method that increases extraction rates 150x on aligned models l...
-
C-Pack: Packed Resources For General Chinese Embeddings
C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
-
RWKV: Reinventing RNNs for the Transformer Era
RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.
-
Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings
Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.
-
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
Expert upcycling expands MoE models by duplicating experts and continuing pre-training, matching baseline performance while saving 32% GPU hours in 7B-13B experiments.
-
Foundation Models for Discovery and Exploration in Chemical Space
MIST models up to 10x larger than prior work, fine-tuned on over 400 structure-property tasks, match or exceed SOTA on benchmarks and demonstrate zero-shot olfactory perception mapping consistent with hyperbolic geometry.
-
DataComp-LM: In search of the next generation of training sets for language models
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
-
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
-
The Falcon Series of Open Language Models
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
-
Textbooks Are All You Need
A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
-
AFRILANGTUTOR: Advancing Language Tutoring and Culture Education in Low-Resource Languages with Large Language Models
New dictionary-derived datasets enable fine-tuned LLMs to act as language tutors for ten low-resource African languages, with SFT plus DPO yielding 1.8-15.5% gains on LLM-as-judge metrics.
-
DOSE: Data Selection for Multi-Modal LLMs via Off-the-Shelf Models
Off-the-shelf models assess quality and alignment to select diverse multimodal training data, letting models trained on the filtered subset match or exceed full-dataset results on standard benchmarks.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
-
Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder
A reduced attention-only decoder shows diminishing returns in dataset scaling, reaching 90% of full accuracy with only 30% of the data.
Reference graph
Works this paper leans on
- [1]
-
[2]
Ibrahim M Alabdulmohsin, Behnam Neyshabur, and Xiaohua Zhai. 2022. Revisiting neural scaling laws in language and vision. Advances in Neural Information Processing Systems, 35:22300–22312
work page 2022
- [3]
-
[4]
Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2015. Suggesting accurate method and class names. In Proceedings of the 2015 10th joint meeting on foundations of software engineering, pages 38–49
work page 2015
-
[5]
Bach, Victor Sanh, Zheng-Xin Yong, Albert Webson, Colin Raffel, Nihal V
Stephen H. Bach, Victor Sanh, Zheng-Xin Yong, Albert Webson, Colin Raffel, Nihal V . Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, Zaid Alyafeai, Manan Dey, Andrea Santilli, Zhiqing Sun, Srulik Ben-David, Canwen Xu, Gunjan Chhablani, Han Wang, Jason Alan Fries, Maged S. Al-shaibani, Shanya Sharma, Urmish Thakker, Khalid Almubarak, Xi...
-
[6]
PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts
- [7]
-
[8]
Yamini Bansal, Behrooz Ghorbani, Ankush Garg, Biao Zhang, Colin Cherry, Behnam Neyshabur, and Orhan Firat. 2022. Data scaling laws in NMT: The effect of noise and architecture. In International Conference on Machine Learning, pages 1466–1482. PMLR
work page 2022
-
[9]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623
work page 2021
- [10]
- [11]
-
[12]
Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. 2023. Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. PIQA: Rea- soning about Physical Commonsense in Natural Language. In Thirty-Fourth AAAI Conference on Artificial Intelligence
work page 2020
-
[14]
Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. 2022. GPT-NeoX-20B: An Open-Source Autoregressive Language Model. arXiv preprint arXiv:2204.06745
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[15]
Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. GPT-Neo: Large scale autoregressive language modeling with mesh-tensorflow. If you use this software, please cite it using these metadata, 58
work page 2021
-
[16]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877– 1901
work page 2020
-
[17]
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2022. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
Thiago Castro Ferreira, Claire Gardent, Nikolai Ilinykh, Chris van der Lee, Simon Mille, Diego Moussallem, and Anastasia Shimorina. 2020. The 2020 Bilingual, Bi-Directional WebNLG+ Shared Task Overview and Evaluation Results (WebNLG+ 2020). In Proceedings of the 3rd WebNLG Workshop on Natural Language Generation from the Semantic Web (WebNLG+ 2020), pages...
work page 2020
-
[19]
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative pretraining from pixels. In International conference on machine learning, pages 1691–1703. PMLR
work page 2020
-
[20]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean,...
-
[22]
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models. arXiv preprint arXiv:2210.11416
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In NAACL
work page 2019
-
[24]
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803.05457v1
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[25]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wen- zek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoy- anov. 2019. Unsupervised Cross-lingual Representation Learning at Scale. arXiv preprint arXiv:1911.02116
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[26]
Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The pascal recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment: First PASCAL Machine 11 Learning Challenges Workshop, MLCW 2005, Southampton, UK, April 11-13, 2005, Revised Selected...
work page 2006
-
[27]
Marie-Catherine De Marneffe, Mandy Simons, and Judith Tonhauser. 2019. The commitment- bank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pages 107–124
work page 2019
- [28]
-
[29]
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. 2018. Universal transformers. arXiv preprint arXiv:1807.03819
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[30]
Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. 2022. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR
work page 2022
-
[31]
Ondˇrej Dušek, Jekaterina Novikova, and Verena Rieser. 2020. Evaluating the State-of-the-Art of End-to-End Natural Language Generation: The E2E NLG Challenge. Computer Speech & Language, 59:123–156
work page 2020
-
[32]
William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res, 23:1–40
work page 2021
-
[33]
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy
-
[34]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv preprint arXiv:2101.00027
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2021. A framework for few-shot language model evaluation
work page 2021
-
[36]
Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. 2020. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[37]
Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Aremu Anuoluwapo, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna Clinciu, Dipanjan Das, Kaustubh D Dhole, et al. 2021. The gem benchmark: Natural language generation, its evaluation and metrics. arXiv preprint arXiv:2102.01672
- [38]
- [39]
-
[40]
Kenneth Heafield. 2011. KenLM: Faster and Smaller Language Model Queries. InProceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland. Association for Computational Linguistics
work page 2011
-
[41]
Peter Henderson, Mark Krass, Lucia Zheng, Neel Guha, Christopher D Manning, Dan Jurafsky, and Daniel Ho. 2022. Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset. Advances in Neural Information Processing Systems, 35:29217– 29234. 12
work page 2022
-
[42]
Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. 2020. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[43]
Danny Hernandez, Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Tom Henighan, Tristan Hume, et al. 2022. Scaling Laws and Interpretability of Learning from Repeated Data. arXiv preprint arXiv:2205.10487
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. 2021. Scaling laws for transfer. arXiv preprint arXiv:2102.01293
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[45]
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al
-
[46]
Training Compute-Optimal Large Language Models
Training Compute-Optimal Large Language Models. arXiv preprint arXiv:2203.15556
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M Dai, Matthew D Hoffman, Monica Dinculescu, and Douglas Eck. 2018. Music transformer. arXiv preprint arXiv:1809.04281
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[48]
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2073–2083
work page 2016
-
[49]
Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating Training Data Mitigates Privacy Risks in Language Models
work page 2022
-
[50]
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[51]
Mikhail Khrushchev, Ruslan Vasilev, Alexey Petrov, and Nikolay Zinov. 2022. YaLM 100B
work page 2022
-
[52]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2014
- [53]
-
[54]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[55]
Aran Komatsuzaki. 2019. One epoch is all you need. arXiv preprint arXiv:1906.06669
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[56]
Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics
work page 2018
- [57]
-
[58]
Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro V on Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen, et al. 2022. The BigScience ROOTS Corpus: A 1.6 TB Composite Multilingual Dataset. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track
work page 2022
-
[59]
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2021. Deduplicating Training Data Makes Language Models Better. arXiv preprint arXiv:2107.06499. 13
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[60]
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Loge...
-
[61]
StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161
work page internal anchor Pith review Pith/arXiv arXiv
-
[62]
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. 2022. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[63]
Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. 2021. Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs, 1
work page 2021
-
[64]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81
work page 2004
- [65]
-
[66]
Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li, Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani, Junxian He, Zhisong Zhang, Xuezhe Ma, et al. 2019. Choosing transfer languages for cross-lingual learning. arXiv preprint arXiv:1905.12688
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[67]
Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. 2023. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. arXiv preprint arXiv:2301.13688
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Deb Roy, and Sara Hooker. 2023. The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Att...
work page 2023
-
[69]
Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, and Daphne Ippolito. 2023. A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
work page 2023
-
[70]
Risto Luukkonen, Ville Komulainen, Jouni Luoma, Anni Eskelinen, Jenna Kanerva, Hanna- Mari Kupari, Filip Ginter, Veronika Laippala, Niklas Muennighoff, Aleksandra Piktus, Thomas Wang, Nouamane Tazi, Teven Le Scao, Thomas Wolf, Osma Suominen, Samuli Sairanen, Mikko Merioksa, Jyrki Heinonen, Aija Vahtola, Samuel Antao, and Sampo Pyysalo. 2023. FinGPT: Large...
work page 2023
- [71]
-
[72]
Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The Natural Language Decathlon: Multitask Learning as Question Answering. CoRR, abs/1806.08730
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [73]
-
[74]
Nasrin Mostafazadeh, Michael Roth, Annie Louis, Nathanael Chambers, and James Allen
-
[75]
Lsdsem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 46–51
work page 2017
- [76]
- [77]
- [78]
-
[79]
Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2022. MTEB: Massive Text Embedding Benchmark. arXiv preprint arXiv:2210.07316
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.