arxiv: 2604.19835 · v2 · submitted 2026-04-21 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

Bing Yin, Binxuan Huang, Chaitanya Dwivedi, Himanshu Gupta, Neeraj Varshney, Pratik Jayarao

Pith reviewed 2026-05-12 01:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords mixture of expertsexpert upcyclingcontinued pre-trainingmodel scalingcompute efficiencysparse routinglanguage model trainingwarm initialization

0 comments

The pith

Expert upcycling duplicates trained experts and continues pre-training to expand MoE capacity while preserving inference cost and cutting total training compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes expert upcycling as a method to grow Mixture-of-Experts models by duplicating experts from a smaller trained checkpoint, extending the router, and then performing continued pre-training to induce specialization among the copies. This approach starts the larger model from a much lower loss than random initialization and keeps per-token computation fixed by holding top-K routing constant. Experiments at 7B-13B total parameters show the resulting model matches the validation loss of a baseline trained from scratch while using 32 percent fewer GPU hours. A utility-based selection rule that duplicates experts according to gradient importance scores further accelerates gap closure when continued pre-training time is limited. The method therefore offers a practical route to larger MoE models without paying the full cost of training them from random initialization.

Core claim

Given a trained E-expert model, the upcycling operator produces an mE-expert model by duplicating each expert m times and expanding the router accordingly, then runs continued pre-training that breaks the initial symmetry so the duplicated experts specialize. The quality gap between this warm-started model and one trained from scratch decomposes into a capacity term (addressed by the extra experts) and an initialization term (largely closed by the inherited representations). Utility-based expert selection, which ranks experts by gradient-based importance, more than triples the fraction of the gap closed under short continued pre-training budgets.

What carries the argument

The upcycling operator, which duplicates experts and extends the router while keeping top-K routing fixed, thereby providing a warm initialization whose symmetry is broken by subsequent continued pre-training.

If this is right

Upcycled models reach the same validation loss as fixed-size baselines trained from scratch.
Training compute measured in GPU hours drops by 32 percent in the 7B-to-13B regime.
Utility-based selection triples gap closure when continued pre-training is budget-limited.
The approach remains effective across different model scales, activation ratios, and MoE architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same duplication-plus-continued-training pattern could be applied to other sparse or modular architectures that currently require full retraining to increase capacity.
If the initialization term shrinks predictably with expert count, practitioners could grow models incrementally rather than committing to a single large random-initialization run.
The decomposition into capacity and initialization terms suggests a testable schedule: measure how much continued pre-training is needed for a given duplication factor before the gap saturates.

Load-bearing premise

Continued pre-training after expert duplication is sufficient to break symmetry and produce specialization comparable to training the larger model from random initialization.

What would settle it

An experiment in which the upcycled model, after the same total training budget as the from-scratch baseline, still shows a measurable gap in final validation loss that does not close even with extended continued pre-training.

Figures

Figures reproduced from arXiv: 2604.19835 by Bing Yin, Binxuan Huang, Chaitanya Dwivedi, Himanshu Gupta, Neeraj Varshney, Pratik Jayarao.

**Figure 1.** Figure 1: Overview of the expert upcycling procedure. Step 1: Pre-train an E-expert MoE for τ steps. Step 2: Apply the upcycling operator Um at step τ : each expert e is replicated re ≥ 1 times (high-utility experts receive more copies, rE > ri ≥ · · · ≥ r1, s.t. Pre = m · E), and the router is extended with replicated slots plus bias noise. All copies are identical at τ , providing a warm initialization. Step 3: Co… view at source ↗

**Figure 2.** Figure 2: Expert upcycling at 50% CPT on the 7B→13B interleaved MoE. Left: Upcycled (32→64) requires 27,888 GPU hours, saving 32% over Fixed-64 (41,328 hours) while using 32% more than Fixed-32 (21,168 hours). Center: Validation loss of Upcycled (1.305) is lower than Fixed-32 (1.339) and close to Fixed-64 (1.308). Right: Downstream benchmark accuracy on six representative tasks; Upcycled matches or exceeds Fixed-64 … view at source ↗

read the original abstract

Mixture-of-Experts (MoE) has become the dominant architecture for scaling large language models: frontier models routinely decouple total parameters from per-token computation through sparse expert routing. Scaling laws show that under fixed active computation, model quality scales predictably with total parameters, and MoEs realize this by increasing expert count. However, training large MoEs is expensive, as memory requirements and inter-device communication both scale with total parameter count. We propose expert upcycling, a method for progressively expanding MoE capacity by increasing the number of experts during continued pre-training (CPT). Given a trained E-expert model, the upcycling operator constructs an mE-expert model through expert duplication and router extension while holding top-K routing fixed, preserving per-token inference cost. Duplication provides a warm initialization: the expanded model inherits the source checkpoint's learned representations, starting from a substantially lower loss than random initialization. Subsequent CPT then breaks the symmetry among duplicated experts to drive specialization. We formalize the upcycling operator and develop a theoretical framework decomposing the quality gap into a capacity term and an initialization term. We further introduce utility-based expert selection, which uses gradient-based importance scores to guide non-uniform duplication, more than tripling gap closure when CPT is limited. In our 7B-13B total parameter experiments, the upcycled model matches the fixed-size baseline on validation loss while saving 32% of GPU hours. Comprehensive ablations across model scales, activation ratios, MoE architectures, and training budgets yield a practical recipe for deploying expert upcycling, establishing it as a principled, compute-efficient alternative to training large MoE models from scratch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Expert duplication plus continued pre-training matches a larger MoE's validation loss at 32% lower GPU cost in the reported 7B-13B runs, but the evidence that duplicated experts actually specialize remains indirect.

read the letter

The main point is that this paper gives a concrete operator for expanding MoE capacity mid-training: duplicate experts from a trained checkpoint, extend the router, keep top-K routing the same, and run continued pre-training. They add a gradient-based utility score to pick which experts to duplicate more often. In the 7B-to-13B experiments the upcycled model reaches the same validation loss as training the larger model from scratch while using 32% fewer GPU hours, and they supply ablations over scales, activation ratios, and budgets plus a practical recipe.

Referee Report

2 major / 2 minor

Summary. The paper proposes expert upcycling for Mixture-of-Experts (MoE) models: given a trained E-expert checkpoint, duplicate experts and extend the router to create an mE-expert model while preserving top-K routing and per-token inference cost. Continued pre-training (CPT) is then used to break symmetry among duplicates. A theoretical framework decomposes the quality gap into capacity and initialization terms; utility-based expert selection (gradient importance scores) is introduced to guide non-uniform duplication. In 7B-13B total-parameter experiments the upcycled model matches a fixed-size baseline on validation loss while using 32% fewer GPU hours, with ablations across scales, activation ratios, architectures, and budgets.

Significance. If the empirical match holds under the stated decomposition, the method supplies a concrete, lower-cost route to larger MoE capacity by exploiting warm-start representations rather than random initialization. The reported 32% GPU-hour saving and the utility-selection ablation (tripling gap closure under limited CPT) would be practically useful for frontier-scale training.

major comments (2)

[§3 and §5.1] §3 (theoretical framework) and §5.1 (7B-13B results): the decomposition of the quality gap into additive capacity and initialization terms presupposes that CPT after duplication induces expert specialization comparable to training the larger model from scratch. No direct diagnostic (pairwise expert cosine similarity, activation overlap, or gradient correlation after CPT) is reported to confirm that duplicated experts have in fact diverged; without it the observed loss match could be explained entirely by the warm-start benefit, undermining the claimed capacity-scaling efficiency.
[§5.2] Table 1 / §5.2 (utility-based selection): the claim that gradient-based importance scores more than triple gap closure is load-bearing for the practical recipe. The paper should show that the selected experts are measurably more diverse post-CPT than uniform duplication (e.g., via the same diversity metric used in the decomposition), otherwise the improvement could be an artifact of the particular importance estimator rather than a general principle.

minor comments (2)

[§4.3] §4.3 (experimental details): data-exclusion rules, validation-set construction, and whether the baseline and upcycled runs used identical token budgets after the upcycling point are not stated explicitly; these details are needed to interpret the 32% GPU-hour figure.
[Figures 3-5] Figures 3-5: axis labels and legend entries use inconsistent abbreviations for “upcycled” vs. “baseline”; a single consistent notation would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each of the major comments below and have made revisions to the manuscript to incorporate additional supporting analyses where appropriate.

read point-by-point responses

Referee: [§3 and §5.1] §3 (theoretical framework) and §5.1 (7B-13B results): the decomposition of the quality gap into additive capacity and initialization terms presupposes that CPT after duplication induces expert specialization comparable to training the larger model from scratch. No direct diagnostic (pairwise expert cosine similarity, activation overlap, or gradient correlation after CPT) is reported to confirm that duplicated experts have in fact diverged; without it the observed loss match could be explained entirely by the warm-start benefit, undermining the claimed capacity-scaling efficiency.

Authors: We appreciate this observation on the assumptions underlying our theoretical framework in §3. The decomposition separates the quality gap into initialization and capacity components based on the observed loss differences. While direct post-CPT diagnostics were not included in the original submission, the consistent matching of validation loss in our 7B-13B experiments, combined with ablations across scales and budgets, supports that specialization occurs during CPT. To provide explicit confirmation, we will add measurements of expert divergence (such as pairwise cosine similarities and activation overlap) after CPT in the revised manuscript. This will directly validate the capacity term and address the possibility that benefits are solely from warm-start. revision: yes
Referee: [§5.2] Table 1 / §5.2 (utility-based selection): the claim that gradient-based importance scores more than triple gap closure is load-bearing for the practical recipe. The paper should show that the selected experts are measurably more diverse post-CPT than uniform duplication (e.g., via the same diversity metric used in the decomposition), otherwise the improvement could be an artifact of the particular importance estimator rather than a general principle.

Authors: We agree that linking the utility-based selection to measurable increases in post-CPT diversity would strengthen the practical utility of the method. The empirical results in §5.2 and Table 1 show that utility selection leads to substantially better gap closure under limited CPT. To demonstrate this is due to greater diversity rather than estimator specifics, we will include in the revision a comparison of the diversity metric (from §3) for utility-selected vs. uniform duplication after CPT. This will confirm that the selected experts exhibit more specialization. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results are independent of any definitional decomposition

full rationale

The paper's central claim is an empirical efficiency result: an upcycled 7B-to-13B MoE matches a fixed-size baseline on validation loss while using 32% fewer GPU hours. This is supported by direct experiments and ablations across scales, activation ratios, and budgets rather than any derivation that reduces by the paper's equations to a fitted quantity or self-citation chain. The mentioned theoretical framework simply decomposes the observed quality gap into named capacity and initialization terms; this naming does not force the experimental outcome or substitute for the reported measurements. No load-bearing step equates a prediction to its own input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the theoretical framework is mentioned but not detailed enough to enumerate.

pith-pipeline@v0.9.0 · 5629 in / 1221 out tokens · 63609 ms · 2026-05-12T01:58:50.943510+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
Theorem 3.1 (Expert upcycling bound) ... decomposes the quality gap into a capacity term and an initialization term
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear
utility-based expert selection, which uses gradient-based importance scores to guide non-uniform duplication

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 7 internal anchors

[1]

Parameters vs FLOPs: Scaling laws for optimal sparsity for mixture-of- experts language models.arXiv preprint arXiv:2501.12370, 2025

Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali, Josh Susskind, and Vimal Thilak. Parameters vs FLOPs: Scaling laws for optimal sparsity for mixture-of- experts language models.arXiv preprint arXiv:2501.12370, 2025

work page arXiv 2025
[2]

Stacking as accelerated gradient descent, 2024

Naman Agarwal, Pranjal Awasthi, Satyen Kale, and Eric Zhao. Stacking as accelerated gradient descent, 2024. URLhttps://arxiv.org/abs/2403.04978

work page arXiv 2024
[3]

Deep progressive training: scaling up depth capacity of zero/one-layer models, 2025

Zhiqi Bu. Deep progressive training: scaling up depth capacity of zero/one-layer models, 2025. URLhttps://arxiv.org/abs/2511.04981

work page arXiv 2025
[4]

Convex dominance in deep learning I: A scaling law of loss and learning rate.arXiv preprint arXiv:2602.07145, 2026

Zhiqi Bu, Shiyun Xu, and Jialin Mao. Convex dominance in deep learning I: A scaling law of loss and learning rate.arXiv preprint arXiv:2602.07145, 2026. Accepted to ICLR 2026

work page arXiv 2026
[5]

Net2net: Accelerating learning via knowl- edge transfer

Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowl- edge transfer. InInternational Conference on Learning Representations (ICLR), 2016. URL https://arxiv.org/abs/1511.05641

work page arXiv 2016
[6]

On the representation collapse of sparse mixture of experts

Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, Heyan Huang, and Furu Wei. On the representation collapse of sparse mixture of experts. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[7]

DeepSeek-V3 Technical Report

DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V . Le, Yonghui Wu, Zhifeng Chen,...

work page arXiv 2022
[9]

Stacking your transformers: A closer look at model growth for efficient LLM pre-training

Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold Cheng, Yike Guo, and Jie Fu. Stacking your transformers: A closer look at model growth for efficient LLM pre-training. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[10]

Revisiting MoE and dense speed-accuracy comparisons for LLM training

Xianzhi Du, Tom Gunter, Xiang Kong, Mark Lee, Zirui Wang, Aonan Zhang, Nan Du, and Ruoming Pang. Revisiting MoE and dense speed-accuracy comparisons for LLM training. arXiv preprint arXiv:2405.15052, 2024

work page arXiv 2024
[11]

Towards an empirical understanding of moe design choices.arXiv preprint arXiv:2402.13089, 2024

Dongyang Fan, Bettina Messmer, and Martin Jaggi. Towards an empirical understanding of moe design choices.arXiv preprint arXiv:2402.13089, 2024

work page arXiv 2024
[12]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research (JMLR), 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research (JMLR), 23(120):1–39, 2022. URLhttps://jmlr.org/papers/v23/21-0998.html

work page 2022
[13]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

GLM-4.5 Team. GLM-4.5: Agentic, reasoning, and coding (ARC) foundation models.arXiv preprint arXiv:2508.06471, 2025. URLhttps://arxiv.org/abs/2508.06471

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Nexus: Adaptive upcycling to efficiently pretrain mixture of experts

Nikolas Gritsch, Qizhen Zhang, Acyr Locatelli, Sara Hooker, and Ahmet Üstün. Nexus: Adaptive upcycling to efficiently pretrain mixture of experts. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 24364–24381. Association for Computational Linguistics, 2025

work page 2025
[15]

Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timothée Lesort

Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L. Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timothée Lesort. Continual pre-training of large language models: How to (re)warm your model?arXiv preprint arXiv:2308.04014, 2023. 12

work page arXiv 2023
[16]

Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 28, 2015

work page 2015
[17]

Second order derivatives for network pruning: Optimal brain surgeon

Babak Hassibi and David G Stork. Second order derivatives for network pruning: Optimal brain surgeon. InAdvances in Neural Information Processing Systems (NeurIPS), volume 5, pages 164–171, 1992

work page 1992
[18]

MIT Press, 2nd edition, 2016

Elad Hazan.Introduction to Online Convex Optimization. MIT Press, 2nd edition, 2016

work page 2016
[19]

Upcycling large language models into mixture of experts.arXiv preprint arXiv:2410.07524, 2024

Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, and Bryan Catanzaro. Upcycling large language models into mixture of experts.arXiv preprint arXiv:2410.07524, 2024

work page arXiv 2024
[20]

Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2015

work page 2015
[21]

DeRS: Towards extremely efficient upcycled mixture-of-experts models.arXiv preprint arXiv:2503.01359, 2025

Yongqi Huang, Peng Ye, Chenyu Huang, Jianjian Cao, Lin Zhang, Baopu Li, Gang Yu, and Tao Chen. DeRS: Towards extremely efficient upcycled mixture-of-experts models.arXiv preprint arXiv:2503.01359, 2025. Accepted at CVPR 2025

work page arXiv 2025
[22]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

MegaScale-MoE:Large-ScaleCommunication- Efficient Training of Mixture-of-Experts Models in Production

Chao Jin, Ziheng Jiang, Zhihao Bai, Zheng Zhong, Juncai Liu, Xiang Li, Ningxin Zheng, Xi Wang, Cong Xie, Qi Huang, Wen Heng, Yiyuan Ma, Wenlei Bao, Size Zheng, Yanghua Peng, Haibin Lin, Xuanzhe Liu, Xin Jin, and Xin Liu. MegaScale-MoE: Large-scale communication- efficient training of mixture-of-experts models in production.arXiv preprint arXiv:2505.11432, 2025

work page arXiv 2025
[24]

Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017

work page 2017
[25]

Sparse upcycling: Training mixture-of-experts from dense checkpoints

Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints.arXiv preprint arXiv:2212.05055, 2022

work page arXiv 2022
[26]

arXiv preprint arXiv:2402.07871 , year=

Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygó´ zd´ z, Piotr Sankowski, Marek Cygan, and Sebastian Jaszczur. Scaling laws for fine-grained mixture of experts.arXiv preprint arXiv:2402.07871, 2024

work page arXiv 2024
[27]

Optimal brain damage

Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. InAdvances in Neural Information Processing Systems (NeurIPS), volume 2, pages 598–605, 1989

work page 1989
[28]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with condi- tional computation and automatic sharding. InInternational Conference on Learning Represen- tations (ICLR), 2021. URLhttps://arxiv.org/abs/2006.16668

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

Datacomp- LM : In search of the next generation of training sets for language models

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardn...

work page arXiv 2024
[30]

Fishers for free? approximating the fisher information matrix by recycling the squared gradient accumulator

Yu Xin Li, Felix Dangel, Derek Tam, and Colin Raffel. Fishers for free? approximating the fisher information matrix by recycling the squared gradient accumulator. InProceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 ofProceedings of Machine Learning Research, pages 34252–34270. PMLR, 2025

work page 2025
[31]

Slimmoe: Structured compression of large moe models via expert slimming and distillation, 2025

Zichong Li, Chen Liang, Zixuan Zhang, Ilgee Hong, Young Jin Kim, Weizhu Chen, and Tuo Zhao. Slimmoe: Structured compression of large moe models via expert slimming and distillation.arXiv preprint arXiv:2506.18349, 2025

work page arXiv 2025
[32]

Scaling laws for upcycling mixture-of-experts language models.arXiv preprint arXiv:2502.03009, 2025

Seng Pei Liew, Takuya Kato, and Sho Takase. Scaling laws for upcycling mixture-of-experts language models.arXiv preprint arXiv:2502.03009, 2025

work page arXiv 2025
[33]

Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models.arXiv preprint arXiv:2402.14800, 2024

Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li. Not all experts are equal: Efficient expert pruning and skipping for mixture-of- experts large language models.arXiv preprint arXiv:2402.14800, 2024

work page arXiv 2024
[34]

Language Model Cascades: Token-Level Uncertainty and Beyond

Jan Ludziejewski, Maciej Píoro, Jakub Krajewski, Maciej Stefaniak, Michał Krutul, Jan Mała´snicki, Marek Cygan, Piotr Sankowski, Kamil Adamczewski, Piotr Miło´s, and Sebastian Jaszczur. Joint MoE scaling laws: Mixture of experts can be memory efficient. InProceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 ofProceedings...

work page internal anchor Pith review doi:10.48550/arxiv 2025
[35]

Llama 4: Natively multimodal foundation models

Meta AI. Llama 4: Natively multimodal foundation models. https://github.com/ meta-llama/llama-models, 2025. Model card available at https://github.com/ meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md

work page 2025
[36]

Pruning convolutional neural networks for resource efficient inference

Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. InInternational Conference on Learning Representations (ICLR), 2017

work page 2017
[37]

Kimi k2: Open agentic intelligence.Technical Report, 2025

Moonshot-AI. Kimi k2: Open agentic intelligence.Technical Report, 2025. URL https: //moonshotai.github.io/Kimi-K2/

work page 2025
[38]

Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan

Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models.arXiv preprint arXiv:2305.16264, 2023

work page arXiv 2023
[39]

Drop-upcycling: Training sparse mixture of experts with partial re-initialization.arXiv preprint arXiv:2502.19261, 2025

Taishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota, and Jun Suzuki. Drop-upcycling: Training sparse mixture of experts with partial re-initialization.arXiv preprint arXiv:2502.19261, 2025

work page arXiv 2025
[40]

LSH-MoE: Communication-efficient MoE training via locality-sensitive hashing

Xiaonan Nie, Qibin Liu, Fangcheng Fu, Shenhan Zhu, Xupeng Miao, Xiaoyang Li, Yang Zhang, Shouda Liu, and Bin Cui. LSH-MoE: Communication-efficient MoE training via locality-sensitive hashing. InAdvances in Neural Information Processing Systems, volume 37, 2024

work page 2024
[41]

Reusing pretrained models by multi-linear operators for efficient training

Yu Pan, Ye Yuan, Yichun Yin, Zenglin Xu, Lifeng Shang, Xin Jiang, and Qun Liu. Reusing pretrained models by multi-linear operators for efficient training. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[42]

Efficient stagewise pretraining via progressive subnetworks.arXiv preprint arXiv:2402.05913, 2024

Abhishek Panigrahi, Nikunj Saunshi, Kaifeng Lyu, Sobhan Miryoosefi, Sashank Reddi, Satyen Kale, and Sanjiv Kumar. Efficient stagewise pretraining via progressive subnetworks.arXiv preprint arXiv:2402.05913, 2024

work page arXiv 2024
[43]

Reuse, don’t retrain: A recipe for continued pretraining of language models,

Jupinder Parmar, Sanjev Satheesh, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Reuse, don’t retrain: A recipe for continued pretraining of language models.arXiv preprint arXiv:2407.07263, 2024

work page arXiv 2024
[44]

Raposo, S

David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, and Adam Santoro. Mixture-of-depths: Dynamically allocating compute in transformer-based language models.arXiv preprint arXiv:2404.02258, 2024

work page arXiv 2024
[45]

The surprising agreement between convex optimization theory and learning-rate scheduling for large model training

Fabian Schaipp, Aaron Defazio, Harsh Mehta, Konstantin Mishchenko, and Ahmed Khaled. The surprising agreement between convex optimization theory and learning-rate scheduling for large model training. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025. 14

work page 2025
[46]

Online learning and online convex optimization.Foundations and Trends in Machine Learning, 4(2):107–194, 2012

Shai Shalev-Shwartz. Online learning and online convex optimization.Foundations and Trends in Machine Learning, 4(2):107–194, 2012

work page 2012
[47]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[48]

Trade-offs of diagonal fisher information matrix estimators

Alexander Soen and Ke Sun. Trade-offs of diagonal fisher information matrix estimators. In Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[49]

Unchosen experts can contribute too: Unleashing moe models’ power by self-contrast.Advances in Neural Information Processing Systems, 37:136897–136921, 2024a

Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston, and Xian Li. Branch-train-MiX: Mixing expert LLMs into a mixture-of-experts LLM.arXiv preprint arXiv:2403.07816, 2024

work page arXiv 2024
[50]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

arXiv preprint arXiv:2507.17702 , year=

Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, and Jun Zhou. Towards greater leverage: Scaling laws for efficient mixture-of-experts language models.arXiv preprint arXiv:2507.17702, 2025

work page arXiv 2025
[52]

Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664,

Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664, 2024

work page arXiv 2024
[53]

Symphony-moe: Harmonizing disparate pre-trained models into a coherent mixture-of-experts.Proceedings of the AAAI Conference on Artificial Intelligence, 2026

Qi Wang, Hanyang Peng, and Yue Yu. Symphony-moe: Harmonizing disparate pre-trained models into a coherent mixture-of-experts.Proceedings of the AAAI Conference on Artificial Intelligence, 2026. arXiv preprint arXiv:2509.18542

work page arXiv 2026
[54]

Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models.Transactions on Machine Learning Research, 2022

work page 2022
[55]

Grove MoE: Towards efficient and superior MoE LLMs with adjugate experts.arXiv preprint arXiv:2508.07785, 2025

Haoyuan Wu, Haoxing Chen, Xiaodong Chen, Zhanchao Zhou, Tieyuan Chen, Yihong Zhuang, Guoshan Lu, Zenan Huang, Junbo Zhao, Lin Liu, Zhenzhong Lan, Bei Yu, and Jianguo Li. Grove MoE: Towards efficient and superior MoE LLMs with adjugate experts.arXiv preprint arXiv:2508.07785, 2025

work page arXiv 2025
[56]

SPARKLING: Balancing signal preservation and symmetry breaking for width-progressive learning.arXiv preprint arXiv:2602.02472, 2026

Qifan Yu, Xinyu Ma, Zhijian Zhuo, Minrui Wang, Deyi Liu, Shiyi Zhan, Yiyuan Ma, Liang Xiang, Xingyan Bin, and Di He. SPARKLING: Balancing signal preservation and symmetry breaking for width-progressive learning.arXiv preprint arXiv:2602.02472, 2026

work page arXiv 2026
[57]

Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning,

Ted Zadouri, Ahmet Üstün, Arash Ahmadian, Beyza Ermi¸ s, Acyr Locatelli, and Sara Hooker. Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning.arXiv preprint arXiv:2309.05444, 2023

work page arXiv 2023
[58]

BAM! just like that: Simple and efficient parameter upcycling for mixture of experts.arXiv preprint arXiv:2408.08274, 2024

Qizhen Zhang, Nikolas Gritsch, Dwaraknath Gnaneshwar, Simon Guo, David Cairuz, Bharat Venkitesh, Jakob Foerster, Phil Blunsom, Sebastian Ruder, Ahmet Ustun, and Acyr Locatelli. BAM! just like that: Simple and efficient parameter upcycling for mixture of experts.arXiv preprint arXiv:2408.08274, 2024

work page arXiv 2024
[59]

empirically convex-like

Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML), pages 928–936, 2003. 15 Appendix Contents Proof of Progressive Training Bound A Theoretical Justification for Gradient-Based Utility Scores B Model Configurations C Heuristic Upcycling M...

work page 2003