pith. sign in

arxiv: 2606.14150 · v2 · pith:QVMLCX7Cnew · submitted 2026-06-12 · 💻 cs.LG · cs.CL

Small LLMs: Pruning vs. Training from Scratch

Pith reviewed 2026-06-27 05:03 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords LLM pruningsmall language modelstraining from scratchmodel compressiontoken budgetLlama-3.1structured pruning
0
0 comments X

The pith

Pruning a large LLM to create small ones outperforms training from scratch when the training token budget is limited.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares pruning Llama-3.1-8B at 50-80 percent reduction ratios using six methods spanning depth, width and sparse granularities against training the resulting small models from random initialization. In one setting the training token count is matched between the two approaches; pruned initialization wins. In the second setting training from scratch receives the entire token count consumed by the full pruning-plus-retraining pipeline; finer-grained pruning still wins while coarser structured pruning loses its edge or is overtaken. The results support a practical rule: a large pretrained parent is useful mainly when the downstream training budget is constrained.

Core claim

Under matched training token budgets, pruned initialization from the parent model consistently outperforms random initialization. When training from scratch is instead allotted the full token budget of the pruning pipeline, pruning at finer granularities retains an advantage while coarser structured pruning can be matched or surpassed. This indicates that the parent model transfers knowledge that additional training tokens alone cannot fully recover, but only when pruning operates at fine granularity.

What carries the argument

Token-matched experimental settings that isolate the effect of pruned initialization from a large parent versus random initialization across six pruning methods at ratios 0.5-0.8.

If this is right

  • When training tokens are scarce, pruning a large parent is the stronger route to a small model.
  • When training tokens are abundant, training from scratch becomes competitive for coarse structured pruning.
  • The performance gap between pruned and random initialization narrows as the pruning ratio increases.
  • Knowledge transferred from the parent model cannot be recovered by tokens alone when pruning is fine-grained.
  • A large pretrained parent is not always required if the practitioner can afford a large training budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • For extremely large future token budgets the marginal value of maintaining and pruning very large parent models may decline.
  • Hardware or data-center operators who can allocate long training runs may safely skip the cost of keeping oversized parent models.
  • The granularity dependence suggests that hybrid pipelines mixing coarse structured pruning with later fine unstructured pruning deserve direct comparison.

Load-bearing premise

The token budgets are accurately matched between the pruning-plus-retraining pipeline and the pure from-scratch condition, and the six tested pruning methods are representative of the broader space of possible techniques.

What would settle it

A controlled replication in which training from scratch with the full pipeline token budget matches or exceeds every pruned model at every granularity and every ratio would falsify the retained advantage of fine-grained pruning.

Figures

Figures reproduced from arXiv: 2606.14150 by Jiachen Zhu, Kunjun Li, Mingjie Sun, Taiming Lu, Yufeng Xu, Zhuang Liu.

Figure 1
Figure 1. Figure 1: Initialization by pruning provides a strong advantage over random initialization, but this advantage diminishes as training continues. Left: under the same training token budget, pruning initialization beats random initialization, although the advantage decreases with longer training. Right: when the random initialization baseline is trained with the full token budget used by the entire pruning pipeline, i… view at source ↗
Figure 2
Figure 2. Figure 2: Pruning granularity and method overview. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Four structured pruning methods across retraining token budgets (Llama-3.1-8B [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: S50 vs. P200-R50 across model sizes for depth and width pruning. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: P200-R50 versus S250 across pruning methods. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Pruning promises a shortcut to strong small language models. In this work, we examine this promise by pruning Llama-3.1-8B at pruning ratios of 0.5--0.8 with six methods spanning depth, width, and sparse granularities, under two controlled token-matched settings. (1) With the same training token budget, pruned initialization consistently outperforms random initialization. This shows that the parent model provides a strong starting point, although the advantage narrows as the training token budget grows and as the pruning ratio rises, nearly vanishing at the highest pruning ratio we study. (2) When training from scratch is instead given the full token budget consumed by the whole pipeline, pruning at finer granularities still retains an advantage, while coarser structured pruning can be matched or surpassed. This suggests that the parent model transfers knowledge that additional training tokens alone cannot fully recover, but only at fine granularity. Taken together, our results yield a clear recommendation: with a large pretrained model in hand and a limited training token budget, pruning is better than training from scratch; when the training budget is not limited, training from scratch can be competitive for coarser pruning, so a large pretrained parent is not always necessary.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper compares pruning Llama-3.1-8B at ratios 0.5-0.8 using six methods (depth, width, sparse) against training from scratch, under two controlled token-matched settings. Setting (1) uses the same post-pruning training tokens and finds pruned init outperforms random init, with the gap narrowing at higher budgets/ratios. Setting (2) gives scratch training the full pipeline token count and finds finer pruning retains advantage while coarser structured pruning can be matched or surpassed. The recommendation is that pruning is preferable with limited budgets but a large parent is not always necessary for coarser pruning with unlimited budgets.

Significance. If the token budgets are verifiably matched and the six methods representative, the work supplies practical guidance on when pretrained knowledge via pruning cannot be recovered by extra tokens alone. The two-setting design with held-out performance measurement is a positive feature for an empirical study in this area.

major comments (1)
  1. [Abstract] Abstract: the central recommendation rests on the two settings being 'token-matched,' yet the abstract provides no explicit accounting of how pretraining tokens are included in the full pipeline budget, how epochs/sequence lengths are normalized across conditions, or whether batch sizes and optimizer steps are identical. This equivalence is load-bearing for the claim that pruning's advantage at limited budgets is due to initialization rather than budget mismatch.
minor comments (2)
  1. [Abstract] Abstract: the six pruning methods are described only at the level of 'spanning depth, width, and sparse granularities' without names or citations; the main text should list them explicitly with references for reproducibility.
  2. [Abstract] Abstract: directional trends are reported without mention of error bars, number of runs, or statistical tests; these should be added to the results sections to allow assessment of the reported narrowing of advantages.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract's description of the token-matching procedure. The concern is well-taken, as precise budget accounting is central to the claims. We address the point below and will revise the abstract accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central recommendation rests on the two settings being 'token-matched,' yet the abstract provides no explicit accounting of how pretraining tokens are included in the full pipeline budget, how epochs/sequence lengths are normalized across conditions, or whether batch sizes and optimizer steps are identical. This equivalence is load-bearing for the claim that pruning's advantage at limited budgets is due to initialization rather than budget mismatch.

    Authors: We agree the abstract would benefit from greater precision on these points. In the revised version we will add a concise clause clarifying that (i) Setting (2) assigns the scratch model the sum of the parent model's original pretraining tokens plus the post-pruning training tokens, (ii) all runs use identical batch size, sequence length, and optimizer hyperparameters, and (iii) the number of optimizer steps is therefore matched once sequence length and batch size are fixed. The full experimental protocol, including these normalizations, is already detailed in Section 3; the abstract revision will simply surface the key equivalences without lengthening the summary excessively. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with independent measurements

full rationale

The paper conducts controlled experiments comparing pruned initializations against random initialization and full-pipeline token budgets, reporting held-out performance metrics. No equations, fitted parameters, uniqueness theorems, or self-citations are used to derive the central claims; the token-matching conditions are stated as experimental controls rather than definitions, and results are falsifiable against external benchmarks. This is a standard empirical study with no load-bearing reductions to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical machine-learning study; no mathematical derivations or new theoretical constructs. Relies only on standard assumptions of LLM training such as the validity of next-token prediction loss and the representativeness of the chosen evaluation metrics.

pith-pipeline@v0.9.1-grok · 5757 in / 1148 out tokens · 23029 ms · 2026-06-27T05:03:45.178374+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 5 linked inside Pith

  1. [1]

    Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

  2. [2]

    Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

    DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

  3. [3]

    Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M

    Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010.14701,

  4. [4]

    Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

  5. [5]

    Pre-training under infinite compute

    Konwoo Kim, Suhas Kotha, Percy Liang, and Tatsunori Hashimoto. Pre-training under infinite compute. arXiv preprint arXiv:2509.14786,

  6. [6]

    Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius

    Asit K. Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius. Accelerating sparse deep neural networks.arXiv preprint arXiv:2104.08378,

  7. [7]

    Reuse, don’t retrain: A recipe for continued pretraining of language models.arXiv preprint arXiv:2407.07263,

    Jupinder Parmar, Sanjeev Satheesh, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Reuse, don’t retrain: A recipe for continued pretraining of language models.arXiv preprint arXiv:2407.07263,

  8. [8]

    Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796,

    Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796,

  9. [9]

    Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a

    Llama Team. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a. Llama Team. Llama 2: Open foundation and fine-tuned chat models, 2023b. Llama Team. The llama 3 herd of models,

  10. [10]

    • § B gives the full related-work discussion condensed in the main paper

    13 Appendix This appendix provides additional methodology details, configurations, and per-benchmark results support- ing the main paper: • § A extends the conclusion with a discussion of what transfers from the larger model and enumerates the axes of variation our study leaves unexplored. • § B gives the full related-work discussion condensed in the main...

  11. [11]

    or Dolma (AI2, 2024), and the Meta-RN comparisons control for architecture but not for the original Meta pretraining mixture. (3)Knowledge-distillation baselines.We compare pruning to plain language-modeling retraining; we do not include the post-pruning knowledge-distillation pipelines used by recent structured methods (e.g. Minitron, ShearedLLaMA), whic...

  12. [12]

    Recent works on pretraining under limited data and unlimited compute have explored the effect of traditional scaling method in this scenario

    Model scaling and constraints.The traditional language model scaling paradigm assumes that simple scaling of model size and training data results in stronger performance (Kaplan et al., 2020), however, as recent language models scale to hundreds of billions of parameters (DeepSeek-AI, 2024; Qwen Team, 2025; Google, 2025a) and training data scales to trill...

  13. [13]

    has emerged as an approach to adapt general-purpose language models for domain specific task. The primary challenge in CPT iscatastrophic forgetting(McCloskey and Cohen, 1989; Luo et al., 2025), where the model loses its prior knowledge and capabilities during the continual learning stage, and the learning rate schedule must be carefully designed to mitig...

  14. [14]

    Evci et al

    conjectures that every randomly initialized network contains a sparse subnetwork that can be trained to match the full network’s performance; crucially, this subnetwork must be trained from itsoriginalinitialization, since random reinitialization substantially degrades performance. Evci et al. (2020) further show that static sparse training from scratch c...

  15. [15]

    The full candidate 19 Ratio Hidden Attn heads MLP Params Sel

    and vary attention heads and MLP size within each. The full candidate 19 Ratio Hidden Attn heads MLP Params Sel. 62.5% 2176 32 7168 3.1B 2304 32 6656 3.1B 2432 32 6144 3.1B ✓ 2560 32 6400 3.1B 2688 32 5888 3.1B 75% 1664 32 6656 2.0B 1792 32 6016 2.0B ✓ 1920 32 5248 2.0B 2048 32 4480 2.0B 81.3% 1408 32 5632 1.5B 1536 32 4736 1.5B ✓ 1664 32 3840 1.5B Table ...

  16. [16]

    For each corpus, we collect 256 sequences of length 8192 (the max position embedding of Llama-3.1) as the evaluation set

    E Evaluation protocol Linguistic perplexity.We evaluate on the general-domain corpora C4 (Raffel et al., 2020), WikiText-103 (Mer- ity et al., 2017), and WikiText-2 (Merity et al., 2017), along with the news-and-summaries corpus CNN Dailymail (Chen et al., 2016). For each corpus, we collect 256 sequences of length 8192 (the max position embedding of Llama...

  17. [17]

    Evci et al

    posits that a pruned subnetwork trained from itsoriginalinitialization converges faster and to higher accuracy than the same sparse structure trained from arandomreinitialization. Evci et al. (2020) further demonstrate that static sparse training can get stuck in isolated local minima, and that allowing the sparse topology to evolve during training helps ...

  18. [18]

    For Wanda and SparseGPT, 50% denotes unstructured sparsity

    Efficiency comparison between pruning methods.Models are obtained by pretraining Llama-3.1-8B for 200B tokens, pruning at the listed ratio, and retraining for 50B tokens. For Wanda and SparseGPT, 50% denotes unstructured sparsity. FLOPs are computed for a single forward pass with sequence length 2048; for sparse models, theoretical FLOPs assume 50% of wei...