L20-Edu-135M: An Auditable Single-GPU Study of Data-Efficient Small Language Modeling
Pith reviewed 2026-06-26 12:03 UTC · model grok-4.3
The pith
A 135M model trained on one GPU with 13 billion tokens reaches 87.1 percent of SmolLM-135M's zero-shot score.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
L20-Edu-135M receives approximately 13B pretraining tokens and obtains a mean score of 0.4150 in a self-run zero-shot six-task harness, representing 87.1% of SmolLM-135M's performance at 2.17% of the nominal token count.
What carries the argument
The complete single-GPU pipeline that combines MinHash/LSH near-deduplication, segment deduplication, benchmark-overlap removal, throughput optimization, SFT with weight interpolation, and RLVR on GSM8K.
If this is right
- L20-Edu-135M exceeds several older 100M-160M public baselines under the same harness.
- Direct GRPO-style RLVR lowers GSM8K exact-match accuracy from 1.82% to 1.59% at 192 tokens and to 1.21% at 320 tokens.
- The full pipeline documentation enables exact reproduction and auditing of every data and training decision.
Where Pith is reading between the lines
- Data curation and overlap removal may allow small models to retain most capability with far fewer tokens than current scaling recipes assume.
- The observed drop from RLVR in this low-token setting points to possible sensitivity of reinforcement stages to data volume or hyperparameter choices.
- Extending the same audited pipeline to 50-100B tokens would test whether the efficiency advantage persists or narrows.
Load-bearing premise
The six-task zero-shot harness after benchmark-overlap removal gives a representative and unbiased measure of relative data efficiency across models trained on different data volumes and sources.
What would settle it
Retraining SmolLM-135M from scratch on the identical 13B-token mixture and pipeline and obtaining a mean score below 0.4150 would support the reported efficiency ratio; a substantially higher score would undermine it.
read the original abstract
Small language models are cheap to serve and feasible on local hardware, but strong public 135M-class systems are commonly trained with hundreds of billions to trillions of tokens on large clusters. We study a sharply resource-constrained regime: a complete 134.5M-parameter language-model pipeline executed on one NVIDIA L20 GPU. The released checkpoint, L20-Edu-135M, receives approximately 13B pretraining tokens: 10B FineWeb-Edu tokens followed by a 3B-token educational, mathematics, code, and reasoning mixture. We document the architecture, data gates, cross-source MinHash/LSH near-deduplication, segment deduplication, benchmark-overlap removal, throughput optimization, supervised fine-tuning (SFT) with weight interpolation, and reinforcement learning from verifiable rewards (RLVR) on GSM8K. In a self-run zero-shot six-task harness, L20-Edu-135M obtains a mean score of 0.4150. It trails SmolLM-135M (0.4767) and SmolLM2-135M (0.4917), but its mean is 87.1% of SmolLM-135M's while its nominal token count is 2.17% as large. This ratio is descriptive, not evidence of statistical equivalence or a controlled scaling law. The model exceeds several older 100M-160M public baselines under the same harness. Direct GRPO-style RLVR decreases GSM8K exact-match accuracy from 1.82% to 1.59% (192-token completions) and 1.21% (320-token completions). These single-run results identify a concrete failure mode rather than establishing a general lower bound on RLVR. The contribution is an auditable resource-constrained case study, not a state-of-the-art claim.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes an end-to-end single-GPU training pipeline for the 134.5M-parameter model L20-Edu-135M. It uses ~13B pretraining tokens (10B FineWeb-Edu followed by a 3B educational/math/code/reasoning mixture), applies MinHash/LSH deduplication, benchmark-overlap removal, SFT with weight interpolation, and RLVR on GSM8K. In a self-run zero-shot six-task harness the model scores 0.4150 (87.1% of SmolLM-135M while using 2.17% of its nominal tokens). All results are explicitly labeled single-run and descriptive; the checkpoint is released.
Significance. If the reported numbers hold, the work supplies a concrete, fully documented, auditable case study of extreme-resource small-LM training. The explicit release of the checkpoint, the detailed accounting of data gates and throughput optimizations, and the transparent reporting of the RLVR performance drop constitute reusable artifacts for the community. The deliberate framing as a descriptive case study rather than a statistical or scaling-law claim is a strength.
minor comments (2)
- [Abstract] Abstract: the relative-performance ratio is computed only versus SmolLM-135M; adding the corresponding ratio versus SmolLM2-135M would give readers immediate context without lengthening the text.
- The six-task harness description would benefit from an explicit list of the tasks and the exact overlap-removal procedure in a dedicated subsection or table.
Simulated Author's Rebuttal
We thank the referee for the detailed and positive summary of the manuscript, the recognition of its value as an auditable single-GPU case study, and the recommendation to accept. No major comments were raised.
Circularity Check
No significant circularity; descriptive case study only
full rationale
The manuscript advances no derivation chain, scaling law, or fitted prediction. All reported numbers (13B tokens, 0.4150 mean score, 87.1% relative performance) are stated as direct outcomes of a single training run and external zero-shot evaluation. The paper explicitly labels the ratio as descriptive only and disclaims any controlled scaling claim. No equations, self-citations, or ansatzes are load-bearing for the core observations.
Axiom & Free-Parameter Ledger
free parameters (2)
- total pretraining tokens =
13B
- data mixture split =
10B + 3B
axioms (2)
- domain assumption Benchmark overlap removal preserves a fair and representative evaluation harness
- domain assumption MinHash/LSH and segment deduplication remove only redundant content without harming downstream capability
Reference graph
Works this paper leans on
-
[1]
SmolLM: Blazingly fast and remarkably powerful.https://huggingface.co/blog/smollm, 2024
Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Leandro von Werra, and Thomas Wolf. SmolLM: Blazingly fast and remarkably powerful.https://huggingface.co/blog/smollm, 2024
2024
-
[2]
SmolLM2: When smol goes big – data-centric training of a small language model, 2025
Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Mart´ ın Bl´ azquez, Guilherme Penedo, Lewis Tunstall, Andr´ es Marafioti, Hynek Kydl´ ıˇ cek, Agust´ ın Piqueres Lajar´ ın, Vaibhav Srivas- tav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Cl´ ementine Fourrier, Ben Burten- shaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raff...
Pith/arXiv arXiv 2025
-
[3]
Pythia: A suite for analyzing large language models across training and scaling,
Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling,
-
[4]
URLhttps://arxiv.org/abs/2304.01373
-
[5]
PIQA: Reasoning about physical commonsense in natural language, 2019
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language, 2019. URLhttps://arxiv.org/abs/1911. 11641. 11
2019
-
[6]
Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URLhttps://arxiv.org/abs/1803.05457
Pith/arXiv arXiv 2018
-
[7]
DeepSeek-R1: Incentivizing reasoning capability in llms via reinforcement learn- ing, 2025
DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in llms via reinforcement learn- ing, 2025. URLhttps://arxiv.org/abs/2501.12948
Pith/arXiv arXiv 2025
-
[9]
URLhttps://arxiv.org/abs/2203.15556
-
[10]
HuggingFaceTB/SmolLM-135M model card.https://huggingface.co/ HuggingFaceTB/SmolLM-135M, 2024
Hugging FaceTB. HuggingFaceTB/SmolLM-135M model card.https://huggingface.co/ HuggingFaceTB/SmolLM-135M, 2024. Accessed 2026-06-20
2024
-
[11]
HuggingFaceTB/SmolLM2-135M model card.https://huggingface.co/ HuggingFaceTB/SmolLM2-135M, 2025
Hugging FaceTB. HuggingFaceTB/SmolLM2-135M model card.https://huggingface.co/ HuggingFaceTB/SmolLM2-135M, 2025. Accessed 2026-06-20
2025
-
[12]
Deduplicating training data makes language models better, 2021
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better, 2021. URLhttps://arxiv.org/abs/2107.06499
Pith/arXiv arXiv 2021
-
[13]
DataComp-LM: In search of the next generation of training sets for language models,
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Rein- hard Heckel, Jean Mercat, Mayee Chen, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, et al. DataComp-LM: In search of the next generation of training sets for language models,
-
[14]
URLhttps://arxiv.org/abs/2406.11794
-
[15]
The LAMBADA dataset: Word prediction requiring a broad discourse context, 2016
Denis Paperno, Germ´ an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern´ andez. The LAMBADA dataset: Word prediction requiring a broad discourse context, 2016. URL https://arxiv.org/abs/1606.06031
Pith/arXiv arXiv 2016
-
[16]
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cap- pelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The re- finedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023. URLhttps://arxiv.org/abs/2306.01116
Pith/arXiv arXiv 2023
-
[17]
The fineweb datasets: Decanting the web for the finest text data at scale, 2024
Guilherme Penedo, Hynek Kydl´ ıˇ cek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URLhttps://arxiv.org/abs/2406.17557
Pith/arXiv arXiv 2024
-
[18]
Language models are unsupervised multitask learners.https : / / openai
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.https : / / openai . com / index / better-language-models/, 2019
2019
-
[19]
WinoGrande: An adversarial winograd schema challenge at scale, 2019
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale, 2019. URLhttps://arxiv.org/abs/1907. 10641. 12
2019
-
[20]
Beyond chinchilla- optimal: Accounting for inference in language model scaling laws, 2024
Nikhil Sardana, Jacob Portes, Sasha Doubov, and Jonathan Frankle. Beyond chinchilla- optimal: Accounting for inference in language model scaling laws, 2024. URLhttps: //arxiv.org/abs/2401.00448
arXiv 2024
-
[21]
Dolma: An open cor- pus of three trillion tokens for language model pretraining research, 2024
Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Lucy Li, Xinxi Lyu, Nathan Lambert, Ian Magnus- son, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, et al. Dolma: An open cor- pus of three trillion to...
arXiv 2024
-
[22]
Hellaswag: Can a machine really finish your sentence?, 2019
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019. URLhttps://arxiv.org/abs/1905.07830
Pith/arXiv arXiv 2019
-
[23]
OPT: Open pre-trained transformer language models, 2022
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models, 2022. URLhttps://arxiv.org/abs/2205.01068. 13
Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.