pith. sign in

arxiv: 2606.14150 · v3 · pith:QVMLCX7Cnew · submitted 2026-06-12 · 💻 cs.LG · cs.CL

Small LLMs: Pruning vs. Training from Scratch

Pith reviewed 2026-06-30 10:58 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords LLM pruningsmall language modelsmodel compressiontraining from scratchpruning methodstoken budgetLlama models
0
0 comments X

The pith

Pruning a large LLM produces stronger small models than training from scratch when training tokens are limited.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares creating small language models by pruning a large pretrained one like Llama-3.1-8B versus training equivalent-sized models from random weights. Across six pruning methods at different ratios and two token budget setups, pruned models start from a better point and end up stronger when the training token count is held equal. The edge shrinks as more tokens are used or as more parameters are removed. When scratch training receives the entire token budget spent on the parent model plus its own training, only the finer-grained pruning methods keep an advantage. This leads to the practical advice that pruning is preferable under tight token limits while scratch training can compete when budgets are ample and pruning is coarse.

Core claim

With the same training token budget, pruned initialization consistently outperforms random initialization, showing that the parent model provides a strong starting point. The advantage narrows as the training token budget grows and as the pruning ratio rises, nearly vanishing at the highest pruning ratio studied. When training from scratch is given the full token budget consumed by the whole pipeline, pruning at finer granularities still retains an advantage, while coarser structured pruning can be matched or surpassed.

What carries the argument

Token-matched controlled comparisons between pruned initializations from Llama-3.1-8B and random initializations, using six pruning methods at ratios 0.5-0.8.

If this is right

  • Pruning is better than training from scratch when the training token budget is limited.
  • The performance advantage of pruning decreases with larger training token budgets.
  • At high pruning ratios, the benefit of starting from a pruned model nearly disappears.
  • Fine-granularity pruning transfers knowledge that extra tokens cannot fully recover.
  • Coarser pruning can be matched by scratch training given enough tokens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Model developers facing token constraints should start from a pruned large model rather than random weights.
  • Knowledge in the parent model is not fully recoverable by scale of training alone at fine pruning levels.
  • Future work could test if these patterns hold for models larger than 8B or different architectures.
  • Pruning may be most useful as a way to bootstrap small models efficiently rather than as a universal replacement for training.

Load-bearing premise

The tested pruning methods and token budgets are representative of general cases for recommending pruning over scratch training.

What would settle it

An experiment where a model trained from scratch with the full parent token budget plus its own training tokens surpasses the fine-grained pruned model at the same final size.

Figures

Figures reproduced from arXiv: 2606.14150 by Jiachen Zhu, Kunjun Li, Mingjie Sun, Taiming Lu, Yufeng Xu, Zhuang Liu.

Figure 1
Figure 1. Figure 1: Initialization by pruning provides a strong advantage over random initialization, but this advantage diminishes as training continues. Left: under the same training token budget, pruning initialization beats random initialization, although the advantage decreases with longer training. Right: when the random initialization baseline is trained with the full token budget used by the entire pruning pipeline, i… view at source ↗
Figure 2
Figure 2. Figure 2: Pruning granularity and method overview. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Four structured pruning methods across retraining token budgets (Llama-3.1-8B [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: S50 vs. P200-R50 across model sizes for depth and width pruning. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: P200-R50 versus S250 across pruning methods. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Pruning promises a shortcut to strong small language models. In this work, we examine this promise by pruning Llama-3.1-8B at pruning ratios of 0.5--0.8 with six methods spanning depth, width, and sparse granularities, under two controlled token-matched settings. (1) With the same training token budget, pruned initialization consistently outperforms random initialization. This shows that the parent model provides a strong starting point, although the advantage narrows as the training token budget grows and as the pruning ratio rises, nearly vanishing at the highest pruning ratio we study. (2) When training from scratch is instead given the full token budget consumed by the whole pipeline, pruning at finer granularities still retains an advantage, while coarser structured pruning can be matched or surpassed. This suggests that the parent model transfers knowledge that additional training tokens alone cannot fully recover, but only at fine granularity. Taken together, our results yield a clear recommendation: with a large pretrained model in hand and a limited training token budget, pruning is better than training from scratch; when the training budget is not limited, training from scratch can be competitive for coarser pruning, so a large pretrained parent is not always necessary.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript compares pruning Llama-3.1-8B at ratios 0.5-0.8 using six methods spanning depth/width/sparse granularities against training from scratch. Under two controlled token-matched settings, it claims (1) pruned initialization consistently outperforms random initialization with equal tokens, though the gap narrows with larger budgets and higher ratios (nearly vanishing at 0.8), and (2) when scratch training receives the full pipeline token budget, fine-grained pruning retains an edge while coarser structured pruning can be matched or surpassed. The authors conclude that pruning is preferable under limited budgets while scratch training can compete otherwise.

Significance. If the empirical patterns hold, the work supplies actionable guidance for small-LLM development by quantifying when a pretrained parent confers an advantage that extra tokens alone cannot recover. The direct, token-controlled comparisons across multiple pruning granularities constitute a concrete strength.

major comments (2)
  1. [Experiments] Experiments section (and abstract): the claim that the pruned-vs-scratch advantage narrows as the training token budget grows (and nearly vanishes at the highest pruning ratio) is supported by only two token-matched settings. This limited sampling does not establish the claimed scaling behavior or underwrite the conditional recommendation that scratch training becomes competitive when budgets are not limited.
  2. [Abstract and Experiments] Abstract and §4: the central outperformance claims are presented without error bars, variance estimates, or exact token counts per run, so it is impossible to judge whether the reported gaps are statistically reliable or sensitive to the specific random seeds and data ordering used.
minor comments (1)
  1. [Methods] The six pruning methods are introduced without a compact summary table of their computational costs or hyper-parameters, making it hard to replicate the exact controls.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our experimental design and result presentation. We respond to each major comment below.

read point-by-point responses
  1. Referee: Experiments section (and abstract): the claim that the pruned-vs-scratch advantage narrows as the training token budget grows (and nearly vanishes at the highest pruning ratio) is supported by only two token-matched settings. This limited sampling does not establish the claimed scaling behavior or underwrite the conditional recommendation that scratch training becomes competitive when budgets are not limited.

    Authors: We appreciate this observation. Our study examines two practically relevant token budgets: one matching the tokens consumed by the full pruning-plus-continued-training pipeline, and one allocating that entire budget to training from scratch. The narrowing of the advantage is observed consistently across pruning ratios 0.5--0.8 in these settings. We will revise the abstract and Experiments section to frame the narrowing as an empirical pattern between the two studied budgets rather than a general scaling law, thereby grounding the conditional recommendation in the available evidence without overgeneralization. revision: partial

  2. Referee: Abstract and §4: the central outperformance claims are presented without error bars, variance estimates, or exact token counts per run, so it is impossible to judge whether the reported gaps are statistically reliable or sensitive to the specific random seeds and data ordering used.

    Authors: We agree that error bars, variance estimates, and exact token counts would strengthen interpretability. Each configuration was run once owing to the substantial compute required for multiple small-LLM trainings. In revision we will add the precise token counts for every setting to the abstract and Experiments section. We will also insert a limitations paragraph noting the single-run design and possible sensitivity to seeds or data order, identifying multi-seed verification as future work. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on direct empirical comparisons without reduction to inputs or self-citations

full rationale

The paper presents an empirical study comparing pruned initializations against random initializations under matched token budgets, using six pruning methods at various ratios. All central claims (e.g., pruned models outperforming scratch training, with advantage narrowing at higher budgets or ratios) are stated as direct observations from the reported experiments rather than derived via equations, fitted parameters renamed as predictions, or load-bearing self-citations. No mathematical derivation chain exists that could reduce to its own inputs by construction; results are obtained from independent training runs and are externally falsifiable by replication. The limited number of token-budget settings affects evidential strength but does not constitute circularity under the defined patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Empirical study with no mathematical derivations, new entities, or axioms; free parameters are the experimental choices of pruning ratios and methods.

free parameters (1)
  • pruning ratios
    Chosen values 0.5-0.8 to span the range studied; not derived from data but set by experiment design.

pith-pipeline@v0.9.1-grok · 5757 in / 1034 out tokens · 28507 ms · 2026-06-30T10:58:41.905102+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 9 canonical work pages · 5 internal anchors

  1. [1]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    10 Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

  2. [2]

    DeepSeek-V3 Technical Report

    DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

  3. [3]

    Scaling Laws for Autoregressive Generative Modeling

    Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010.14701,

  4. [4]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

  5. [5]

    Pre-training under infinite compute

    Konwoo Kim, Suhas Kotha, Percy Liang, and Tatsunori Hashimoto. Pre-training under infinite compute. arXiv preprint arXiv:2509.14786,

  6. [6]

    Accelerating sparse deep neural networks.arXiv preprint arXiv:2104.08378, 2021

    Asit K. Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius. Accelerating sparse deep neural networks.arXiv preprint arXiv:2104.08378,

  7. [7]

    Reuse, don’t retrain: A recipe for continued pretraining of language models.arXiv preprint arXiv:2407.07263,

    Jupinder Parmar, Sanjeev Satheesh, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Reuse, don’t retrain: A recipe for continued pretraining of language models.arXiv preprint arXiv:2407.07263,

  8. [8]

    Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796,

    Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796,

  9. [9]

    LLaMA: Open and Efficient Foundation Language Models

    Llama Team. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a. Llama Team. Llama 2: Open foundation and fine-tuned chat models, 2023b. Llama Team. The llama 3 herd of models,

  10. [10]

    • § B gives the full related-work discussion condensed in the main paper

    14 Appendix This appendix provides additional methodology details, configurations, and per-benchmark results support- ing the main paper: • § A extends the conclusion with a discussion of what transfers from the larger model and enumerates the axes of variation our study leaves unexplored. • § B gives the full related-work discussion condensed in the main...

  11. [11]

    or Dolma (AI2, 2024), and the Meta-RN comparisons control for architecture but not for the original Meta pretraining mixture. (3)Knowledge-distillation baselines.We compare pruning to plain language-modeling retraining; we do not include the post-pruning knowledge-distillation pipelines used by recent structured methods (e.g. Minitron, ShearedLLaMA), whic...

  12. [12]

    Recent works on pretraining under limited data and unlimited compute have explored the effect of traditional scaling method in this scenario

    Model scaling and constraints.The traditional language model scaling paradigm assumes that simple scaling of model size and training data results in stronger performance (Kaplan et al., 2020), however, as recent language models scale to hundreds of billions of parameters (DeepSeek-AI, 2024; Qwen Team, 2025; Google, 2025a) and training data scales to trill...

  13. [13]

    has emerged as an approach to adapt general-purpose language models for domain specific task. The primary challenge in CPT iscatastrophic forgetting(McCloskey and Cohen, 1989; Luo et al., 2025), where the model loses its prior knowledge and capabilities during the continual learning stage, and the learning rate schedule must be carefully designed to mitig...

  14. [14]

    Evci et al

    conjectures that every randomly initialized network contains a sparse subnetwork that can be trained to match the full network’s performance; crucially, this subnetwork must be trained from itsoriginalinitialization, since random reinitialization substantially degrades performance. Evci et al. (2020) further show that static sparse training from scratch c...

  15. [15]

    The full candidate 20 Ratio Hidden Attn heads MLP Params Sel

    and vary attention heads and MLP size within each. The full candidate 20 Ratio Hidden Attn heads MLP Params Sel. 62.5% 2176 32 7168 3.1B 2304 32 6656 3.1B 2432 32 6144 3.1B ✓ 2560 32 6400 3.1B 2688 32 5888 3.1B 75% 1664 32 6656 2.0B 1792 32 6016 2.0B ✓ 1920 32 5248 2.0B 2048 32 4480 2.0B 81.3% 1408 32 5632 1.5B 1536 32 4736 1.5B ✓ 1664 32 3840 1.5B Table ...

  16. [16]

    For each corpus, we collect 256 sequences of length 8192 (the max position embedding of Llama-3.1) as the evaluation set

    E Evaluation protocol Linguistic perplexity.We evaluate on the general-domain corpora C4 (Raffel et al., 2020), WikiText-103 (Mer- ity et al., 2017), and WikiText-2 (Merity et al., 2017), along with the news-and-summaries corpus CNN Dailymail (Chen et al., 2016). For each corpus, we collect 256 sequences of length 8192 (the max position embedding of Llama...

  17. [17]

    Evci et al

    posits that a pruned subnetwork trained from itsoriginalinitialization converges faster and to higher accuracy than the same sparse structure trained from arandomreinitialization. Evci et al. (2020) further demonstrate that static sparse training can get stuck in isolated local minima, and that allowing the sparse topology to evolve during training helps ...

  18. [18]

    For Wanda and SparseGPT, 50% denotes unstructured sparsity

    Efficiency comparison between pruning methods.Models are obtained by pretraining Llama-3.1-8B for 200B tokens, pruning at the listed ratio, and retraining for 50B tokens. For Wanda and SparseGPT, 50% denotes unstructured sparsity. FLOPs are computed for a single forward pass with sequence length 2048; for sparse models, theoretical FLOPs assume 50% of wei...