pith. sign in

arxiv: 2606.22189 · v1 · pith:JBSZIZ2Anew · submitted 2026-06-20 · 💻 cs.LG · cs.AI

L20-Edu-135M: An Auditable Single-GPU Study of Data-Efficient Small Language Modeling

Pith reviewed 2026-06-26 12:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords small language modelsdata efficiencysingle-GPU trainingpretraining pipelinezero-shot evaluationdeduplicationeducational data
0
0 comments X

The pith

A 135M model trained on one GPU with 13 billion tokens reaches 87.1 percent of SmolLM-135M's zero-shot score.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates a sharply constrained training regime where an entire 134.5M-parameter language model pipeline runs on a single NVIDIA L20 GPU. It releases L20-Edu-135M after pretraining on roughly 13 billion tokens drawn from FineWeb-Edu followed by an educational, math, code, and reasoning mixture, with full documentation of deduplication steps, benchmark overlap removal, and post-training stages. In a self-run zero-shot harness on six tasks the model scores 0.4150 on average, which equals 87.1 percent of SmolLM-135M while using only 2.17 percent of that model's nominal token count. The work supplies an auditable case study rather than a scaling-law proof or state-of-the-art claim.

Core claim

L20-Edu-135M receives approximately 13B pretraining tokens and obtains a mean score of 0.4150 in a self-run zero-shot six-task harness, representing 87.1% of SmolLM-135M's performance at 2.17% of the nominal token count.

What carries the argument

The complete single-GPU pipeline that combines MinHash/LSH near-deduplication, segment deduplication, benchmark-overlap removal, throughput optimization, SFT with weight interpolation, and RLVR on GSM8K.

If this is right

  • L20-Edu-135M exceeds several older 100M-160M public baselines under the same harness.
  • Direct GRPO-style RLVR lowers GSM8K exact-match accuracy from 1.82% to 1.59% at 192 tokens and to 1.21% at 320 tokens.
  • The full pipeline documentation enables exact reproduction and auditing of every data and training decision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Data curation and overlap removal may allow small models to retain most capability with far fewer tokens than current scaling recipes assume.
  • The observed drop from RLVR in this low-token setting points to possible sensitivity of reinforcement stages to data volume or hyperparameter choices.
  • Extending the same audited pipeline to 50-100B tokens would test whether the efficiency advantage persists or narrows.

Load-bearing premise

The six-task zero-shot harness after benchmark-overlap removal gives a representative and unbiased measure of relative data efficiency across models trained on different data volumes and sources.

What would settle it

Retraining SmolLM-135M from scratch on the identical 13B-token mixture and pipeline and obtaining a mean score below 0.4150 would support the reported efficiency ratio; a substantially higher score would undermine it.

read the original abstract

Small language models are cheap to serve and feasible on local hardware, but strong public 135M-class systems are commonly trained with hundreds of billions to trillions of tokens on large clusters. We study a sharply resource-constrained regime: a complete 134.5M-parameter language-model pipeline executed on one NVIDIA L20 GPU. The released checkpoint, L20-Edu-135M, receives approximately 13B pretraining tokens: 10B FineWeb-Edu tokens followed by a 3B-token educational, mathematics, code, and reasoning mixture. We document the architecture, data gates, cross-source MinHash/LSH near-deduplication, segment deduplication, benchmark-overlap removal, throughput optimization, supervised fine-tuning (SFT) with weight interpolation, and reinforcement learning from verifiable rewards (RLVR) on GSM8K. In a self-run zero-shot six-task harness, L20-Edu-135M obtains a mean score of 0.4150. It trails SmolLM-135M (0.4767) and SmolLM2-135M (0.4917), but its mean is 87.1% of SmolLM-135M's while its nominal token count is 2.17% as large. This ratio is descriptive, not evidence of statistical equivalence or a controlled scaling law. The model exceeds several older 100M-160M public baselines under the same harness. Direct GRPO-style RLVR decreases GSM8K exact-match accuracy from 1.82% to 1.59% (192-token completions) and 1.21% (320-token completions). These single-run results identify a concrete failure mode rather than establishing a general lower bound on RLVR. The contribution is an auditable resource-constrained case study, not a state-of-the-art claim.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript describes an end-to-end single-GPU training pipeline for the 134.5M-parameter model L20-Edu-135M. It uses ~13B pretraining tokens (10B FineWeb-Edu followed by a 3B educational/math/code/reasoning mixture), applies MinHash/LSH deduplication, benchmark-overlap removal, SFT with weight interpolation, and RLVR on GSM8K. In a self-run zero-shot six-task harness the model scores 0.4150 (87.1% of SmolLM-135M while using 2.17% of its nominal tokens). All results are explicitly labeled single-run and descriptive; the checkpoint is released.

Significance. If the reported numbers hold, the work supplies a concrete, fully documented, auditable case study of extreme-resource small-LM training. The explicit release of the checkpoint, the detailed accounting of data gates and throughput optimizations, and the transparent reporting of the RLVR performance drop constitute reusable artifacts for the community. The deliberate framing as a descriptive case study rather than a statistical or scaling-law claim is a strength.

minor comments (2)
  1. [Abstract] Abstract: the relative-performance ratio is computed only versus SmolLM-135M; adding the corresponding ratio versus SmolLM2-135M would give readers immediate context without lengthening the text.
  2. The six-task harness description would benefit from an explicit list of the tasks and the exact overlap-removal procedure in a dedicated subsection or table.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the detailed and positive summary of the manuscript, the recognition of its value as an auditable single-GPU case study, and the recommendation to accept. No major comments were raised.

Circularity Check

0 steps flagged

No significant circularity; descriptive case study only

full rationale

The manuscript advances no derivation chain, scaling law, or fitted prediction. All reported numbers (13B tokens, 0.4150 mean score, 87.1% relative performance) are stated as direct outcomes of a single training run and external zero-shot evaluation. The paper explicitly labels the ratio as descriptive only and disclaims any controlled scaling claim. No equations, self-citations, or ansatzes are load-bearing for the core observations.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The study relies on standard assumptions about the validity of deduplicated educational data and benchmark-based evaluation for measuring data efficiency. The token budget and mixture ratios are chosen parameters rather than derived quantities.

free parameters (2)
  • total pretraining tokens = 13B
    The 13B token budget is a deliberate choice defining the resource-constrained regime under study.
  • data mixture split = 10B + 3B
    The 10B FineWeb-Edu plus 3B educational/math/code/reasoning split is selected by the authors to target the target domain.
axioms (2)
  • domain assumption Benchmark overlap removal preserves a fair and representative evaluation harness
    Invoked to justify the six-task zero-shot comparison after data cleaning.
  • domain assumption MinHash/LSH and segment deduplication remove only redundant content without harming downstream capability
    Central to the data gates described in the pipeline.

pith-pipeline@v0.9.1-grok · 5872 in / 1528 out tokens · 59322 ms · 2026-06-26T12:03:20.475151+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 12 linked inside Pith

  1. [1]

    SmolLM: Blazingly fast and remarkably powerful.https://huggingface.co/blog/smollm, 2024

    Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Leandro von Werra, and Thomas Wolf. SmolLM: Blazingly fast and remarkably powerful.https://huggingface.co/blog/smollm, 2024

  2. [2]

    SmolLM2: When smol goes big – data-centric training of a small language model, 2025

    Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Mart´ ın Bl´ azquez, Guilherme Penedo, Lewis Tunstall, Andr´ es Marafioti, Hynek Kydl´ ıˇ cek, Agust´ ın Piqueres Lajar´ ın, Vaibhav Srivas- tav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Cl´ ementine Fourrier, Ben Burten- shaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raff...

  3. [3]

    Pythia: A suite for analyzing large language models across training and scaling,

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling,

  4. [4]

    URLhttps://arxiv.org/abs/2304.01373

  5. [5]

    PIQA: Reasoning about physical commonsense in natural language, 2019

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language, 2019. URLhttps://arxiv.org/abs/1911. 11641. 11

  6. [6]

    Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URLhttps://arxiv.org/abs/1803.05457

  7. [7]

    DeepSeek-R1: Incentivizing reasoning capability in llms via reinforcement learn- ing, 2025

    DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in llms via reinforcement learn- ing, 2025. URLhttps://arxiv.org/abs/2501.12948

  8. [9]

    URLhttps://arxiv.org/abs/2203.15556

  9. [10]

    HuggingFaceTB/SmolLM-135M model card.https://huggingface.co/ HuggingFaceTB/SmolLM-135M, 2024

    Hugging FaceTB. HuggingFaceTB/SmolLM-135M model card.https://huggingface.co/ HuggingFaceTB/SmolLM-135M, 2024. Accessed 2026-06-20

  10. [11]

    HuggingFaceTB/SmolLM2-135M model card.https://huggingface.co/ HuggingFaceTB/SmolLM2-135M, 2025

    Hugging FaceTB. HuggingFaceTB/SmolLM2-135M model card.https://huggingface.co/ HuggingFaceTB/SmolLM2-135M, 2025. Accessed 2026-06-20

  11. [12]

    Deduplicating training data makes language models better, 2021

    Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better, 2021. URLhttps://arxiv.org/abs/2107.06499

  12. [13]

    DataComp-LM: In search of the next generation of training sets for language models,

    Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Rein- hard Heckel, Jean Mercat, Mayee Chen, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, et al. DataComp-LM: In search of the next generation of training sets for language models,

  13. [14]

    URLhttps://arxiv.org/abs/2406.11794

  14. [15]

    The LAMBADA dataset: Word prediction requiring a broad discourse context, 2016

    Denis Paperno, Germ´ an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern´ andez. The LAMBADA dataset: Word prediction requiring a broad discourse context, 2016. URL https://arxiv.org/abs/1606.06031

  15. [16]

    The re- finedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023

    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cap- pelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The re- finedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023. URLhttps://arxiv.org/abs/2306.01116

  16. [17]

    The fineweb datasets: Decanting the web for the finest text data at scale, 2024

    Guilherme Penedo, Hynek Kydl´ ıˇ cek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URLhttps://arxiv.org/abs/2406.17557

  17. [18]

    Language models are unsupervised multitask learners.https : / / openai

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.https : / / openai . com / index / better-language-models/, 2019

  18. [19]

    WinoGrande: An adversarial winograd schema challenge at scale, 2019

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale, 2019. URLhttps://arxiv.org/abs/1907. 10641. 12

  19. [20]

    Beyond chinchilla- optimal: Accounting for inference in language model scaling laws, 2024

    Nikhil Sardana, Jacob Portes, Sasha Doubov, and Jonathan Frankle. Beyond chinchilla- optimal: Accounting for inference in language model scaling laws, 2024. URLhttps: //arxiv.org/abs/2401.00448

  20. [21]

    Dolma: An open cor- pus of three trillion tokens for language model pretraining research, 2024

    Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Lucy Li, Xinxi Lyu, Nathan Lambert, Ian Magnus- son, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, et al. Dolma: An open cor- pus of three trillion to...

  21. [22]

    Hellaswag: Can a machine really finish your sentence?, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019. URLhttps://arxiv.org/abs/1905.07830

  22. [23]

    OPT: Open pre-trained transformer language models, 2022

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models, 2022. URLhttps://arxiv.org/abs/2205.01068. 13