Small LLMs: Pruning vs. Training from Scratch
Pith reviewed 2026-06-27 05:03 UTC · model grok-4.3
The pith
Pruning a large LLM to create small ones outperforms training from scratch when the training token budget is limited.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under matched training token budgets, pruned initialization from the parent model consistently outperforms random initialization. When training from scratch is instead allotted the full token budget of the pruning pipeline, pruning at finer granularities retains an advantage while coarser structured pruning can be matched or surpassed. This indicates that the parent model transfers knowledge that additional training tokens alone cannot fully recover, but only when pruning operates at fine granularity.
What carries the argument
Token-matched experimental settings that isolate the effect of pruned initialization from a large parent versus random initialization across six pruning methods at ratios 0.5-0.8.
If this is right
- When training tokens are scarce, pruning a large parent is the stronger route to a small model.
- When training tokens are abundant, training from scratch becomes competitive for coarse structured pruning.
- The performance gap between pruned and random initialization narrows as the pruning ratio increases.
- Knowledge transferred from the parent model cannot be recovered by tokens alone when pruning is fine-grained.
- A large pretrained parent is not always required if the practitioner can afford a large training budget.
Where Pith is reading between the lines
- For extremely large future token budgets the marginal value of maintaining and pruning very large parent models may decline.
- Hardware or data-center operators who can allocate long training runs may safely skip the cost of keeping oversized parent models.
- The granularity dependence suggests that hybrid pipelines mixing coarse structured pruning with later fine unstructured pruning deserve direct comparison.
Load-bearing premise
The token budgets are accurately matched between the pruning-plus-retraining pipeline and the pure from-scratch condition, and the six tested pruning methods are representative of the broader space of possible techniques.
What would settle it
A controlled replication in which training from scratch with the full pipeline token budget matches or exceeds every pruned model at every granularity and every ratio would falsify the retained advantage of fine-grained pruning.
Figures
read the original abstract
Pruning promises a shortcut to strong small language models. In this work, we examine this promise by pruning Llama-3.1-8B at pruning ratios of 0.5--0.8 with six methods spanning depth, width, and sparse granularities, under two controlled token-matched settings. (1) With the same training token budget, pruned initialization consistently outperforms random initialization. This shows that the parent model provides a strong starting point, although the advantage narrows as the training token budget grows and as the pruning ratio rises, nearly vanishing at the highest pruning ratio we study. (2) When training from scratch is instead given the full token budget consumed by the whole pipeline, pruning at finer granularities still retains an advantage, while coarser structured pruning can be matched or surpassed. This suggests that the parent model transfers knowledge that additional training tokens alone cannot fully recover, but only at fine granularity. Taken together, our results yield a clear recommendation: with a large pretrained model in hand and a limited training token budget, pruning is better than training from scratch; when the training budget is not limited, training from scratch can be competitive for coarser pruning, so a large pretrained parent is not always necessary.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper compares pruning Llama-3.1-8B at ratios 0.5-0.8 using six methods (depth, width, sparse) against training from scratch, under two controlled token-matched settings. Setting (1) uses the same post-pruning training tokens and finds pruned init outperforms random init, with the gap narrowing at higher budgets/ratios. Setting (2) gives scratch training the full pipeline token count and finds finer pruning retains advantage while coarser structured pruning can be matched or surpassed. The recommendation is that pruning is preferable with limited budgets but a large parent is not always necessary for coarser pruning with unlimited budgets.
Significance. If the token budgets are verifiably matched and the six methods representative, the work supplies practical guidance on when pretrained knowledge via pruning cannot be recovered by extra tokens alone. The two-setting design with held-out performance measurement is a positive feature for an empirical study in this area.
major comments (1)
- [Abstract] Abstract: the central recommendation rests on the two settings being 'token-matched,' yet the abstract provides no explicit accounting of how pretraining tokens are included in the full pipeline budget, how epochs/sequence lengths are normalized across conditions, or whether batch sizes and optimizer steps are identical. This equivalence is load-bearing for the claim that pruning's advantage at limited budgets is due to initialization rather than budget mismatch.
minor comments (2)
- [Abstract] Abstract: the six pruning methods are described only at the level of 'spanning depth, width, and sparse granularities' without names or citations; the main text should list them explicitly with references for reproducibility.
- [Abstract] Abstract: directional trends are reported without mention of error bars, number of runs, or statistical tests; these should be added to the results sections to allow assessment of the reported narrowing of advantages.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract's description of the token-matching procedure. The concern is well-taken, as precise budget accounting is central to the claims. We address the point below and will revise the abstract accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central recommendation rests on the two settings being 'token-matched,' yet the abstract provides no explicit accounting of how pretraining tokens are included in the full pipeline budget, how epochs/sequence lengths are normalized across conditions, or whether batch sizes and optimizer steps are identical. This equivalence is load-bearing for the claim that pruning's advantage at limited budgets is due to initialization rather than budget mismatch.
Authors: We agree the abstract would benefit from greater precision on these points. In the revised version we will add a concise clause clarifying that (i) Setting (2) assigns the scratch model the sum of the parent model's original pretraining tokens plus the post-pruning training tokens, (ii) all runs use identical batch size, sequence length, and optimizer hyperparameters, and (iii) the number of optimizer steps is therefore matched once sequence length and batch size are fixed. The full experimental protocol, including these normalizations, is already detailed in Section 3; the abstract revision will simply surface the key equivalences without lengthening the summary excessively. revision: yes
Circularity Check
No circularity: purely empirical comparison with independent measurements
full rationale
The paper conducts controlled experiments comparing pruned initializations against random initialization and full-pipeline token budgets, reporting held-out performance metrics. No equations, fitted parameters, uniqueness theorems, or self-citations are used to derive the central claims; the token-matching conditions are stated as experimental controls rather than definitions, and results are falsifiable against external benchmarks. This is a standard empirical study with no load-bearing reductions to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,
-
[2]
Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,
DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,
-
[3]
Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010.14701,
Pith/arXiv arXiv 2010
-
[4]
Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,
Pith/arXiv arXiv 2001
-
[5]
Pre-training under infinite compute
Konwoo Kim, Suhas Kotha, Percy Liang, and Tatsunori Hashimoto. Pre-training under infinite compute. arXiv preprint arXiv:2509.14786,
-
[6]
Asit K. Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius. Accelerating sparse deep neural networks.arXiv preprint arXiv:2104.08378,
-
[7]
Jupinder Parmar, Sanjeev Satheesh, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Reuse, don’t retrain: A recipe for continued pretraining of language models.arXiv preprint arXiv:2407.07263,
-
[8]
Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796,
Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796,
-
[9]
Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a
Llama Team. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a. Llama Team. Llama 2: Open foundation and fine-tuned chat models, 2023b. Llama Team. The llama 3 herd of models,
-
[10]
• § B gives the full related-work discussion condensed in the main paper
13 Appendix This appendix provides additional methodology details, configurations, and per-benchmark results support- ing the main paper: • § A extends the conclusion with a discussion of what transfers from the larger model and enumerates the axes of variation our study leaves unexplored. • § B gives the full related-work discussion condensed in the main...
2025
-
[11]
or Dolma (AI2, 2024), and the Meta-RN comparisons control for architecture but not for the original Meta pretraining mixture. (3)Knowledge-distillation baselines.We compare pruning to plain language-modeling retraining; we do not include the post-pruning knowledge-distillation pipelines used by recent structured methods (e.g. Minitron, ShearedLLaMA), whic...
2024
-
[12]
Recent works on pretraining under limited data and unlimited compute have explored the effect of traditional scaling method in this scenario
Model scaling and constraints.The traditional language model scaling paradigm assumes that simple scaling of model size and training data results in stronger performance (Kaplan et al., 2020), however, as recent language models scale to hundreds of billions of parameters (DeepSeek-AI, 2024; Qwen Team, 2025; Google, 2025a) and training data scales to trill...
2020
-
[13]
has emerged as an approach to adapt general-purpose language models for domain specific task. The primary challenge in CPT iscatastrophic forgetting(McCloskey and Cohen, 1989; Luo et al., 2025), where the model loses its prior knowledge and capabilities during the continual learning stage, and the learning rate schedule must be carefully designed to mitig...
1989
-
[14]
Evci et al
conjectures that every randomly initialized network contains a sparse subnetwork that can be trained to match the full network’s performance; crucially, this subnetwork must be trained from itsoriginalinitialization, since random reinitialization substantially degrades performance. Evci et al. (2020) further show that static sparse training from scratch c...
2020
-
[15]
The full candidate 19 Ratio Hidden Attn heads MLP Params Sel
and vary attention heads and MLP size within each. The full candidate 19 Ratio Hidden Attn heads MLP Params Sel. 62.5% 2176 32 7168 3.1B 2304 32 6656 3.1B 2432 32 6144 3.1B ✓ 2560 32 6400 3.1B 2688 32 5888 3.1B 75% 1664 32 6656 2.0B 1792 32 6016 2.0B ✓ 1920 32 5248 2.0B 2048 32 4480 2.0B 81.3% 1408 32 5632 1.5B 1536 32 4736 1.5B ✓ 1664 32 3840 1.5B Table ...
1920
-
[16]
For each corpus, we collect 256 sequences of length 8192 (the max position embedding of Llama-3.1) as the evaluation set
E Evaluation protocol Linguistic perplexity.We evaluate on the general-domain corpora C4 (Raffel et al., 2020), WikiText-103 (Mer- ity et al., 2017), and WikiText-2 (Merity et al., 2017), along with the news-and-summaries corpus CNN Dailymail (Chen et al., 2016). For each corpus, we collect 256 sequences of length 8192 (the max position embedding of Llama...
2020
-
[17]
Evci et al
posits that a pruned subnetwork trained from itsoriginalinitialization converges faster and to higher accuracy than the same sparse structure trained from arandomreinitialization. Evci et al. (2020) further demonstrate that static sparse training can get stuck in isolated local minima, and that allowing the sparse topology to evolve during training helps ...
2020
-
[18]
For Wanda and SparseGPT, 50% denotes unstructured sparsity
Efficiency comparison between pruning methods.Models are obtained by pretraining Llama-3.1-8B for 200B tokens, pruning at the listed ratio, and retraining for 50B tokens. For Wanda and SparseGPT, 50% denotes unstructured sparsity. FLOPs are computed for a single forward pass with sequence length 2048; for sparse models, theoretical FLOPs assume 50% of wei...
2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.