Pruning and Distilling Mixture-of-Experts into Dense Language Models
Pith reviewed 2026-06-29 12:36 UTC · model grok-4.3
The pith
Converting a trained Mixture-of-Experts model to a dense model by scoring, grouping and distilling experts outperforms pruning a dense model by 6.3 percentage points at matched size.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A trained MoE can be converted into a standard dense model by scoring experts, selecting and grouping them, concatenating their parameters into a dense FFN and distilling from the MoE teacher; the resulting dense model exceeds the downstream accuracy of a dense model produced by pruning a dense teacher at the same parameter count.
What carries the argument
Diversity-aware scoring of experts, which ranks them to maximize coverage of distinct knowledge before grouping and distillation into a dense feed-forward network.
If this is right
- Frontier MoE models become usable in memory-limited settings without having to load every expert at inference time.
- Knowledge distillation from an MoE teacher trains to target accuracy faster than distillation from a dense teacher of equal size.
- Scoring method dominates performance among the tested design choices across Qwen3-30B-A3B, DeepSeek-V2-Lite and GPT-OSS-20B.
- The conversion leaves a fully dense model whose inference cost and memory footprint match those of any standard transformer of the same width.
Where Pith is reading between the lines
- The same pipeline could let practitioners run large MoE checkpoints on edge devices by shipping only the distilled dense version.
- Task-specific re-scoring of experts after initial conversion might recover additional accuracy without retraining the entire student.
- If the diversity signal generalizes, future MoE training runs could deliberately encourage expert specialization knowing that excess experts can later be folded into dense layers.
Load-bearing premise
The expert selection and grouping choices identified on the three evaluated models will produce comparable gains on other MoE architectures or larger scales without additional hyper-parameter search.
What would settle it
Apply the same scoring-grouping-distillation pipeline to a fourth, previously unseen MoE model and observe that the final dense student underperforms a matched-size dense-to-dense pruned baseline by more than three points on the same downstream suite.
read the original abstract
Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a framework for converting trained MoE models to dense architectures: experts are scored (including a novel diversity-aware method), selected, grouped, concatenated into a dense FFN, and refined via knowledge distillation from the MoE teacher. Across 350 configurations on Qwen3-30B-A3B (plus evaluations on DeepSeek-V2-Lite and GPT-OSS-20B), scoring method is identified as most impactful. Under matched parameter count, MoE-to-dense yields +6.3 pp higher average downstream accuracy than dense-to-dense pruning after ~4B-token distillation, at 1.6x faster wall-clock training time.
Significance. If the comparison is properly controlled, the work offers a practical route to deploy MoE-derived capabilities in memory-constrained dense models. The scale of 350 configurations evaluated provides substantial empirical coverage of design choices, which is a strength for an applied compression study.
major comments (2)
- [Abstract] Abstract: the central claim of a '+6.3 pp' gain 'under a controlled comparison at matched parameter count' after '~4B-token distillation' does not state whether the dense-to-dense pruning baseline receives equivalent distillation from the MoE teacher for the same token budget. Without this, the delta cannot be unambiguously attributed to the expert scoring/grouping steps rather than differences in the distillation protocol.
- [Experiments] Experiments section (description of 350 configurations and downstream results): no error bars, standard deviations, or number of random seeds/runs are reported for any accuracy numbers, including the headline +6.3 pp delta. This undermines assessment of whether the reported gains exceed typical run-to-run variance in language-model fine-tuning.
minor comments (2)
- [Abstract] Abstract and experimental protocol: downstream task names, dataset splits, and evaluation settings are not listed, hindering direct reproduction of the accuracy numbers.
- The paper evaluates three MoE models but does not discuss whether the identified scoring/grouping hyperparameters transfer without retuning to other MoE families or larger scales.
Simulated Author's Rebuttal
Thank you for the constructive feedback. We address each major comment below and indicate the corresponding revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of a '+6.3 pp' gain 'under a controlled comparison at matched parameter count' after '~4B-token distillation' does not state whether the dense-to-dense pruning baseline receives equivalent distillation from the MoE teacher for the same token budget. Without this, the delta cannot be unambiguously attributed to the expert scoring/grouping steps rather than differences in the distillation protocol.
Authors: We confirm that the dense-to-dense pruning baseline receives identical knowledge distillation from the MoE teacher using the same ~4B-token budget, ensuring the comparison is controlled. The reported gain is therefore attributable to the MoE-to-dense pipeline. We will revise the abstract to state this explicitly. revision: yes
-
Referee: [Experiments] Experiments section (description of 350 configurations and downstream results): no error bars, standard deviations, or number of random seeds/runs are reported for any accuracy numbers, including the headline +6.3 pp delta. This undermines assessment of whether the reported gains exceed typical run-to-run variance in language-model fine-tuning.
Authors: We agree that variance reporting would improve assessment of the results. The scale of 350 configurations made multiple seeds per run computationally prohibitive. In the revision we will add standard deviations from repeated runs on the primary configurations and the headline comparison. revision: yes
Circularity Check
No circularity: purely empirical evaluation with no derivations or fitted predictions
full rationale
The paper describes an empirical pipeline of scoring, grouping, concatenation, and distillation evaluated across 350 configurations on three models. No equations, uniqueness theorems, ansatzes, or predictions are presented that could reduce to inputs by construction. All reported gains (e.g., +6.3 pp) are direct experimental outcomes at matched parameter counts after fixed-token distillation. No self-citations are load-bearing for any central claim, and the work contains no mathematical derivation chain. This is a standard non-finding for an empirical methods paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
gpt-oss-120b & gpt-oss-20b Model Card
S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [2]
-
[3]
Chen, H.-S
I.-C. Chen, H.-S. Liu, W.-F. Sun, C.-H. Chao, Y.-C. Hsu, and C.-Y. Lee. Retraining-free merging of sparse MoE via hierarchical clustering. In International Conference on Machine Learning, 2025
2025
- [4]
-
[5]
D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, Z. Xie, Y. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang. DeepSeekMoE : Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-AI . DeepSeek-V2 : A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Deepseek-v4: Towards highly efficient million-token context intelligence
DeepSeek-AI . Deepseek-v4: Towards highly efficient million-token context intelligence. Technical report, 2026. URL https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf
2026
-
[8]
Fedus, B
W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23: 0 1--39, 2022
2022
-
[9]
Google DeepMind . Gemma 4. https://deepmind.google/models/gemma/gemma-4/, 2026
2026
-
[10]
Distilling the Knowledge in a Neural Network
G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[11]
S. Jha, M. Hashemzadeh, A. Saheb Pasand, A. Parviz, M.-J. Lee, and B. Knyazev. REAM : Merging improves pruning of experts in LLMs . arXiv preprint arXiv:2604.04356, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [12]
- [13]
-
[14]
REAP the Experts: Why Pruning Prevails for One-Shot MoE compression
M. Lasby, I. Lazarevich, N. Sinnadurai, S. Lie, Y. Ioannou, and V. Thangarasa. Reap the experts: Why pruning prevails for one-shot moe compression. arXiv preprint arXiv:2510.13999, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [15]
-
[16]
P. Li, Z. Zhang, P. Yadav, Y.-L. Sung, Y. Cheng, M. Bansal, and T. Chen. Merge, then compress: Demystify efficient SMoE with hints from its routing policy. In International Conference on Learning Representations, 2024
2024
- [17]
-
[18]
A. H. Liu, K. Khandelwal, S. Subramanian, V. Jouault, et al. Ministral 3. arXiv preprint arXiv:2601.08584, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
Pointer Sentinel Mixture Models
S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[20]
The llama 4 herd: The beginning of a new era of natively multimodal ai innovation
Meta . The llama 4 herd: The beginning of a new era of natively multimodal ai innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence/, 2025
2025
- [21]
-
[22]
S. Muralidharan, S. T. Sreenivas, R. Joshi, M. Chochowski, M. Patwary, M. Shoeybi, B. Catanzaro, J. Kautz, and P. Molchanov. Compact language models via pruning and knowledge distillation. arXiv preprint arXiv:2407.14679, 2024
-
[23]
G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions--- I . Mathematical Programming, 14 0 (1): 0 265--294, 1978
1978
-
[24]
D. V. Nguyen, A. T. Nguyen, M. H. Nguyen, L. Q. Nguyen, S. Jiang, E. Fetaya, L. D. Tran, G. Chechik, and T. M. Nguyen. Expert merging in sparse mixture of experts with nash bargaining. arXiv preprint arXiv:2510.16138, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Penedo, H
G. Penedo, H. Kydl \' c ek, A. Lozhkov, M. Mitchell, C. Raffel, L. Von Werra, T. Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale. Advances in Neural Information Processing Systems, 37: 0 30811--30849, 2024
2024
-
[26]
Pukelsheim
F. Pukelsheim. Optimal Design of Experiments. SIAM, 2006
2006
-
[27]
Roy and M
O. Roy and M. Vetterli. The effective rank: A measure of effective dimensionality. 15th European Signal Processing Conference, pages 606--610, 2007
2007
-
[28]
N. Shazeer. GLU variants improve transformer. arXiv preprint arXiv:2002.05202, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[29]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [30]
-
[31]
M. Sun, Z. Liu, A. Bair, and J. Z. Kolter. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
R. Wang, A. Bhagia, and S. Min. EMO : Pretraining mixture of experts for emergent modularity. arXiv preprint arXiv:2605.06663, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [33]
- [34]
-
[35]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
C. Yang, Y. Sui, J. Xiao, L. Huang, Y. Gong, Y. Duan, W. Jia, M. Yin, Y. Cheng, and B. Yuan. MoE-I ^2 : Compressing mixture of experts models through inter-expert pruning and intra-expert low-rank decomposition. In Findings of EMNLP, 2024
2024
-
[37]
A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [38]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.