pith. machine review for the scientific record. sign in

arxiv: 2604.19835 · v2 · submitted 2026-04-21 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

Bing Yin, Binxuan Huang, Chaitanya Dwivedi, Himanshu Gupta, Neeraj Varshney, Pratik Jayarao

Pith reviewed 2026-05-12 01:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords mixture of expertsexpert upcyclingcontinued pre-trainingmodel scalingcompute efficiencysparse routinglanguage model trainingwarm initialization
0
0 comments X

The pith

Expert upcycling duplicates trained experts and continues pre-training to expand MoE capacity while preserving inference cost and cutting total training compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes expert upcycling as a method to grow Mixture-of-Experts models by duplicating experts from a smaller trained checkpoint, extending the router, and then performing continued pre-training to induce specialization among the copies. This approach starts the larger model from a much lower loss than random initialization and keeps per-token computation fixed by holding top-K routing constant. Experiments at 7B-13B total parameters show the resulting model matches the validation loss of a baseline trained from scratch while using 32 percent fewer GPU hours. A utility-based selection rule that duplicates experts according to gradient importance scores further accelerates gap closure when continued pre-training time is limited. The method therefore offers a practical route to larger MoE models without paying the full cost of training them from random initialization.

Core claim

Given a trained E-expert model, the upcycling operator produces an mE-expert model by duplicating each expert m times and expanding the router accordingly, then runs continued pre-training that breaks the initial symmetry so the duplicated experts specialize. The quality gap between this warm-started model and one trained from scratch decomposes into a capacity term (addressed by the extra experts) and an initialization term (largely closed by the inherited representations). Utility-based expert selection, which ranks experts by gradient-based importance, more than triples the fraction of the gap closed under short continued pre-training budgets.

What carries the argument

The upcycling operator, which duplicates experts and extends the router while keeping top-K routing fixed, thereby providing a warm initialization whose symmetry is broken by subsequent continued pre-training.

If this is right

  • Upcycled models reach the same validation loss as fixed-size baselines trained from scratch.
  • Training compute measured in GPU hours drops by 32 percent in the 7B-to-13B regime.
  • Utility-based selection triples gap closure when continued pre-training is budget-limited.
  • The approach remains effective across different model scales, activation ratios, and MoE architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same duplication-plus-continued-training pattern could be applied to other sparse or modular architectures that currently require full retraining to increase capacity.
  • If the initialization term shrinks predictably with expert count, practitioners could grow models incrementally rather than committing to a single large random-initialization run.
  • The decomposition into capacity and initialization terms suggests a testable schedule: measure how much continued pre-training is needed for a given duplication factor before the gap saturates.

Load-bearing premise

Continued pre-training after expert duplication is sufficient to break symmetry and produce specialization comparable to training the larger model from random initialization.

What would settle it

An experiment in which the upcycled model, after the same total training budget as the from-scratch baseline, still shows a measurable gap in final validation loss that does not close even with extended continued pre-training.

Figures

Figures reproduced from arXiv: 2604.19835 by Bing Yin, Binxuan Huang, Chaitanya Dwivedi, Himanshu Gupta, Neeraj Varshney, Pratik Jayarao.

Figure 1
Figure 1. Figure 1: Overview of the expert upcycling procedure. Step 1: Pre-train an E-expert MoE for τ steps. Step 2: Apply the upcycling operator Um at step τ : each expert e is replicated re ≥ 1 times (high-utility experts receive more copies, rE > ri ≥ · · · ≥ r1, s.t. Pre = m · E), and the router is extended with replicated slots plus bias noise. All copies are identical at τ , providing a warm initialization. Step 3: Co… view at source ↗
Figure 2
Figure 2. Figure 2: Expert upcycling at 50% CPT on the 7B→13B interleaved MoE. Left: Upcycled (32→64) requires 27,888 GPU hours, saving 32% over Fixed-64 (41,328 hours) while using 32% more than Fixed-32 (21,168 hours). Center: Validation loss of Upcycled (1.305) is lower than Fixed-32 (1.339) and close to Fixed-64 (1.308). Right: Downstream benchmark accuracy on six representative tasks; Upcycled matches or exceeds Fixed-64 … view at source ↗
read the original abstract

Mixture-of-Experts (MoE) has become the dominant architecture for scaling large language models: frontier models routinely decouple total parameters from per-token computation through sparse expert routing. Scaling laws show that under fixed active computation, model quality scales predictably with total parameters, and MoEs realize this by increasing expert count. However, training large MoEs is expensive, as memory requirements and inter-device communication both scale with total parameter count. We propose expert upcycling, a method for progressively expanding MoE capacity by increasing the number of experts during continued pre-training (CPT). Given a trained E-expert model, the upcycling operator constructs an mE-expert model through expert duplication and router extension while holding top-K routing fixed, preserving per-token inference cost. Duplication provides a warm initialization: the expanded model inherits the source checkpoint's learned representations, starting from a substantially lower loss than random initialization. Subsequent CPT then breaks the symmetry among duplicated experts to drive specialization. We formalize the upcycling operator and develop a theoretical framework decomposing the quality gap into a capacity term and an initialization term. We further introduce utility-based expert selection, which uses gradient-based importance scores to guide non-uniform duplication, more than tripling gap closure when CPT is limited. In our 7B-13B total parameter experiments, the upcycled model matches the fixed-size baseline on validation loss while saving 32% of GPU hours. Comprehensive ablations across model scales, activation ratios, MoE architectures, and training budgets yield a practical recipe for deploying expert upcycling, establishing it as a principled, compute-efficient alternative to training large MoE models from scratch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes expert upcycling for Mixture-of-Experts (MoE) models: given a trained E-expert checkpoint, duplicate experts and extend the router to create an mE-expert model while preserving top-K routing and per-token inference cost. Continued pre-training (CPT) is then used to break symmetry among duplicates. A theoretical framework decomposes the quality gap into capacity and initialization terms; utility-based expert selection (gradient importance scores) is introduced to guide non-uniform duplication. In 7B-13B total-parameter experiments the upcycled model matches a fixed-size baseline on validation loss while using 32% fewer GPU hours, with ablations across scales, activation ratios, architectures, and budgets.

Significance. If the empirical match holds under the stated decomposition, the method supplies a concrete, lower-cost route to larger MoE capacity by exploiting warm-start representations rather than random initialization. The reported 32% GPU-hour saving and the utility-selection ablation (tripling gap closure under limited CPT) would be practically useful for frontier-scale training.

major comments (2)
  1. [§3 and §5.1] §3 (theoretical framework) and §5.1 (7B-13B results): the decomposition of the quality gap into additive capacity and initialization terms presupposes that CPT after duplication induces expert specialization comparable to training the larger model from scratch. No direct diagnostic (pairwise expert cosine similarity, activation overlap, or gradient correlation after CPT) is reported to confirm that duplicated experts have in fact diverged; without it the observed loss match could be explained entirely by the warm-start benefit, undermining the claimed capacity-scaling efficiency.
  2. [§5.2] Table 1 / §5.2 (utility-based selection): the claim that gradient-based importance scores more than triple gap closure is load-bearing for the practical recipe. The paper should show that the selected experts are measurably more diverse post-CPT than uniform duplication (e.g., via the same diversity metric used in the decomposition), otherwise the improvement could be an artifact of the particular importance estimator rather than a general principle.
minor comments (2)
  1. [§4.3] §4.3 (experimental details): data-exclusion rules, validation-set construction, and whether the baseline and upcycled runs used identical token budgets after the upcycling point are not stated explicitly; these details are needed to interpret the 32% GPU-hour figure.
  2. [Figures 3-5] Figures 3-5: axis labels and legend entries use inconsistent abbreviations for “upcycled” vs. “baseline”; a single consistent notation would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each of the major comments below and have made revisions to the manuscript to incorporate additional supporting analyses where appropriate.

read point-by-point responses
  1. Referee: [§3 and §5.1] §3 (theoretical framework) and §5.1 (7B-13B results): the decomposition of the quality gap into additive capacity and initialization terms presupposes that CPT after duplication induces expert specialization comparable to training the larger model from scratch. No direct diagnostic (pairwise expert cosine similarity, activation overlap, or gradient correlation after CPT) is reported to confirm that duplicated experts have in fact diverged; without it the observed loss match could be explained entirely by the warm-start benefit, undermining the claimed capacity-scaling efficiency.

    Authors: We appreciate this observation on the assumptions underlying our theoretical framework in §3. The decomposition separates the quality gap into initialization and capacity components based on the observed loss differences. While direct post-CPT diagnostics were not included in the original submission, the consistent matching of validation loss in our 7B-13B experiments, combined with ablations across scales and budgets, supports that specialization occurs during CPT. To provide explicit confirmation, we will add measurements of expert divergence (such as pairwise cosine similarities and activation overlap) after CPT in the revised manuscript. This will directly validate the capacity term and address the possibility that benefits are solely from warm-start. revision: yes

  2. Referee: [§5.2] Table 1 / §5.2 (utility-based selection): the claim that gradient-based importance scores more than triple gap closure is load-bearing for the practical recipe. The paper should show that the selected experts are measurably more diverse post-CPT than uniform duplication (e.g., via the same diversity metric used in the decomposition), otherwise the improvement could be an artifact of the particular importance estimator rather than a general principle.

    Authors: We agree that linking the utility-based selection to measurable increases in post-CPT diversity would strengthen the practical utility of the method. The empirical results in §5.2 and Table 1 show that utility selection leads to substantially better gap closure under limited CPT. To demonstrate this is due to greater diversity rather than estimator specifics, we will include in the revision a comparison of the diversity metric (from §3) for utility-selected vs. uniform duplication after CPT. This will confirm that the selected experts exhibit more specialization. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results are independent of any definitional decomposition

full rationale

The paper's central claim is an empirical efficiency result: an upcycled 7B-to-13B MoE matches a fixed-size baseline on validation loss while using 32% fewer GPU hours. This is supported by direct experiments and ablations across scales, activation ratios, and budgets rather than any derivation that reduces by the paper's equations to a fitted quantity or self-citation chain. The mentioned theoretical framework simply decomposes the observed quality gap into named capacity and initialization terms; this naming does not force the experimental outcome or substitute for the reported measurements. No load-bearing step equates a prediction to its own input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the theoretical framework is mentioned but not detailed enough to enumerate.

pith-pipeline@v0.9.0 · 5629 in / 1221 out tokens · 63609 ms · 2026-05-12T01:58:50.943510+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 7 internal anchors

  1. [1]

    Parameters vs FLOPs: Scaling laws for optimal sparsity for mixture-of- experts language models.arXiv preprint arXiv:2501.12370, 2025

    Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali, Josh Susskind, and Vimal Thilak. Parameters vs FLOPs: Scaling laws for optimal sparsity for mixture-of- experts language models.arXiv preprint arXiv:2501.12370, 2025

  2. [2]

    Stacking as accelerated gradient descent, 2024

    Naman Agarwal, Pranjal Awasthi, Satyen Kale, and Eric Zhao. Stacking as accelerated gradient descent, 2024. URLhttps://arxiv.org/abs/2403.04978

  3. [3]

    Deep progressive training: scaling up depth capacity of zero/one-layer models, 2025

    Zhiqi Bu. Deep progressive training: scaling up depth capacity of zero/one-layer models, 2025. URLhttps://arxiv.org/abs/2511.04981

  4. [4]

    Convex dominance in deep learning I: A scaling law of loss and learning rate.arXiv preprint arXiv:2602.07145, 2026

    Zhiqi Bu, Shiyun Xu, and Jialin Mao. Convex dominance in deep learning I: A scaling law of loss and learning rate.arXiv preprint arXiv:2602.07145, 2026. Accepted to ICLR 2026

  5. [5]

    Net2net: Accelerating learning via knowl- edge transfer

    Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowl- edge transfer. InInternational Conference on Learning Representations (ICLR), 2016. URL https://arxiv.org/abs/1511.05641

  6. [6]

    On the representation collapse of sparse mixture of experts

    Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, Heyan Huang, and Furu Wei. On the representation collapse of sparse mixture of experts. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  7. [7]

    DeepSeek-V3 Technical Report

    DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  8. [8]

    Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V . Le, Yonghui Wu, Zhifeng Chen,...

  9. [9]

    Stacking your transformers: A closer look at model growth for efficient LLM pre-training

    Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold Cheng, Yike Guo, and Jie Fu. Stacking your transformers: A closer look at model growth for efficient LLM pre-training. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  10. [10]

    Revisiting MoE and dense speed-accuracy comparisons for LLM training

    Xianzhi Du, Tom Gunter, Xiang Kong, Mark Lee, Zirui Wang, Aonan Zhang, Nan Du, and Ruoming Pang. Revisiting MoE and dense speed-accuracy comparisons for LLM training. arXiv preprint arXiv:2405.15052, 2024

  11. [11]

    Towards an empirical understanding of moe design choices.arXiv preprint arXiv:2402.13089, 2024

    Dongyang Fan, Bettina Messmer, and Martin Jaggi. Towards an empirical understanding of moe design choices.arXiv preprint arXiv:2402.13089, 2024

  12. [12]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research (JMLR), 23(120):1–39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research (JMLR), 23(120):1–39, 2022. URLhttps://jmlr.org/papers/v23/21-0998.html

  13. [13]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    GLM-4.5 Team. GLM-4.5: Agentic, reasoning, and coding (ARC) foundation models.arXiv preprint arXiv:2508.06471, 2025. URLhttps://arxiv.org/abs/2508.06471

  14. [14]

    Nexus: Adaptive upcycling to efficiently pretrain mixture of experts

    Nikolas Gritsch, Qizhen Zhang, Acyr Locatelli, Sara Hooker, and Ahmet Üstün. Nexus: Adaptive upcycling to efficiently pretrain mixture of experts. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 24364–24381. Association for Computational Linguistics, 2025

  15. [15]

    Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timothée Lesort

    Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L. Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timothée Lesort. Continual pre-training of large language models: How to (re)warm your model?arXiv preprint arXiv:2308.04014, 2023. 12

  16. [16]

    Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 28, 2015

  17. [17]

    Second order derivatives for network pruning: Optimal brain surgeon

    Babak Hassibi and David G Stork. Second order derivatives for network pruning: Optimal brain surgeon. InAdvances in Neural Information Processing Systems (NeurIPS), volume 5, pages 164–171, 1992

  18. [18]

    MIT Press, 2nd edition, 2016

    Elad Hazan.Introduction to Online Convex Optimization. MIT Press, 2nd edition, 2016

  19. [19]

    Upcycling large language models into mixture of experts.arXiv preprint arXiv:2410.07524, 2024

    Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, and Bryan Catanzaro. Upcycling large language models into mixture of experts.arXiv preprint arXiv:2410.07524, 2024

  20. [20]

    Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2015

  21. [21]

    DeRS: Towards extremely efficient upcycled mixture-of-experts models.arXiv preprint arXiv:2503.01359, 2025

    Yongqi Huang, Peng Ye, Chenyu Huang, Jianjian Cao, Lin Zhang, Baopu Li, Gang Yu, and Tao Chen. DeRS: Towards extremely efficient upcycled mixture-of-experts models.arXiv preprint arXiv:2503.01359, 2025. Accepted at CVPR 2025

  22. [22]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

  23. [23]

    MegaScale-MoE:Large-ScaleCommunication- Efficient Training of Mixture-of-Experts Models in Production

    Chao Jin, Ziheng Jiang, Zhihao Bai, Zheng Zhong, Juncai Liu, Xiang Li, Ningxin Zheng, Xi Wang, Cong Xie, Qi Huang, Wen Heng, Yiyuan Ma, Wenlei Bao, Size Zheng, Yanghua Peng, Haibin Lin, Xuanzhe Liu, Xin Jin, and Xin Liu. MegaScale-MoE: Large-scale communication- efficient training of mixture-of-experts models in production.arXiv preprint arXiv:2505.11432, 2025

  24. [24]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017

  25. [25]

    Sparse upcycling: Training mixture-of-experts from dense checkpoints

    Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints.arXiv preprint arXiv:2212.05055, 2022

  26. [26]

    arXiv preprint arXiv:2402.07871 , year=

    Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygó´ zd´ z, Piotr Sankowski, Marek Cygan, and Sebastian Jaszczur. Scaling laws for fine-grained mixture of experts.arXiv preprint arXiv:2402.07871, 2024

  27. [27]

    Optimal brain damage

    Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. InAdvances in Neural Information Processing Systems (NeurIPS), volume 2, pages 598–605, 1989

  28. [28]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with condi- tional computation and automatic sharding. InInternational Conference on Learning Represen- tations (ICLR), 2021. URLhttps://arxiv.org/abs/2006.16668

  29. [29]

    Datacomp- LM : In search of the next generation of training sets for language models

    Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardn...

  30. [30]

    Fishers for free? approximating the fisher information matrix by recycling the squared gradient accumulator

    Yu Xin Li, Felix Dangel, Derek Tam, and Colin Raffel. Fishers for free? approximating the fisher information matrix by recycling the squared gradient accumulator. InProceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 ofProceedings of Machine Learning Research, pages 34252–34270. PMLR, 2025

  31. [31]

    Slimmoe: Structured compression of large moe models via expert slimming and distillation, 2025

    Zichong Li, Chen Liang, Zixuan Zhang, Ilgee Hong, Young Jin Kim, Weizhu Chen, and Tuo Zhao. Slimmoe: Structured compression of large moe models via expert slimming and distillation.arXiv preprint arXiv:2506.18349, 2025

  32. [32]

    Scaling laws for upcycling mixture-of-experts language models.arXiv preprint arXiv:2502.03009, 2025

    Seng Pei Liew, Takuya Kato, and Sho Takase. Scaling laws for upcycling mixture-of-experts language models.arXiv preprint arXiv:2502.03009, 2025

  33. [33]

    Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models.arXiv preprint arXiv:2402.14800, 2024

    Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li. Not all experts are equal: Efficient expert pruning and skipping for mixture-of- experts large language models.arXiv preprint arXiv:2402.14800, 2024

  34. [34]

    Language Model Cascades: Token-Level Uncertainty and Beyond

    Jan Ludziejewski, Maciej Píoro, Jakub Krajewski, Maciej Stefaniak, Michał Krutul, Jan Mała´snicki, Marek Cygan, Piotr Sankowski, Kamil Adamczewski, Piotr Miło´s, and Sebastian Jaszczur. Joint MoE scaling laws: Mixture of experts can be memory efficient. InProceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 ofProceedings...

  35. [35]

    Llama 4: Natively multimodal foundation models

    Meta AI. Llama 4: Natively multimodal foundation models. https://github.com/ meta-llama/llama-models, 2025. Model card available at https://github.com/ meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md

  36. [36]

    Pruning convolutional neural networks for resource efficient inference

    Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. InInternational Conference on Learning Representations (ICLR), 2017

  37. [37]

    Kimi k2: Open agentic intelligence.Technical Report, 2025

    Moonshot-AI. Kimi k2: Open agentic intelligence.Technical Report, 2025. URL https: //moonshotai.github.io/Kimi-K2/

  38. [38]

    Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan

    Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models.arXiv preprint arXiv:2305.16264, 2023

  39. [39]

    Drop-upcycling: Training sparse mixture of experts with partial re-initialization.arXiv preprint arXiv:2502.19261, 2025

    Taishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota, and Jun Suzuki. Drop-upcycling: Training sparse mixture of experts with partial re-initialization.arXiv preprint arXiv:2502.19261, 2025

  40. [40]

    LSH-MoE: Communication-efficient MoE training via locality-sensitive hashing

    Xiaonan Nie, Qibin Liu, Fangcheng Fu, Shenhan Zhu, Xupeng Miao, Xiaoyang Li, Yang Zhang, Shouda Liu, and Bin Cui. LSH-MoE: Communication-efficient MoE training via locality-sensitive hashing. InAdvances in Neural Information Processing Systems, volume 37, 2024

  41. [41]

    Reusing pretrained models by multi-linear operators for efficient training

    Yu Pan, Ye Yuan, Yichun Yin, Zenglin Xu, Lifeng Shang, Xin Jiang, and Qun Liu. Reusing pretrained models by multi-linear operators for efficient training. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  42. [42]

    Efficient stagewise pretraining via progressive subnetworks.arXiv preprint arXiv:2402.05913, 2024

    Abhishek Panigrahi, Nikunj Saunshi, Kaifeng Lyu, Sobhan Miryoosefi, Sashank Reddi, Satyen Kale, and Sanjiv Kumar. Efficient stagewise pretraining via progressive subnetworks.arXiv preprint arXiv:2402.05913, 2024

  43. [43]

    Reuse, don’t retrain: A recipe for continued pretraining of language models,

    Jupinder Parmar, Sanjev Satheesh, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Reuse, don’t retrain: A recipe for continued pretraining of language models.arXiv preprint arXiv:2407.07263, 2024

  44. [44]

    Raposo, S

    David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, and Adam Santoro. Mixture-of-depths: Dynamically allocating compute in transformer-based language models.arXiv preprint arXiv:2404.02258, 2024

  45. [45]

    The surprising agreement between convex optimization theory and learning-rate scheduling for large model training

    Fabian Schaipp, Aaron Defazio, Harsh Mehta, Konstantin Mishchenko, and Ahmed Khaled. The surprising agreement between convex optimization theory and learning-rate scheduling for large model training. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025. 14

  46. [46]

    Online learning and online convex optimization.Foundations and Trends in Machine Learning, 4(2):107–194, 2012

    Shai Shalev-Shwartz. Online learning and online convex optimization.Foundations and Trends in Machine Learning, 4(2):107–194, 2012

  47. [47]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

  48. [48]

    Trade-offs of diagonal fisher information matrix estimators

    Alexander Soen and Ke Sun. Trade-offs of diagonal fisher information matrix estimators. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  49. [49]

    Unchosen experts can contribute too: Unleashing moe models’ power by self-contrast.Advances in Neural Information Processing Systems, 37:136897–136921, 2024a

    Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston, and Xian Li. Branch-train-MiX: Mixing expert LLMs into a mixture-of-experts LLM.arXiv preprint arXiv:2403.07816, 2024

  50. [50]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  51. [51]

    arXiv preprint arXiv:2507.17702 , year=

    Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, and Jun Zhou. Towards greater leverage: Scaling laws for efficient mixture-of-experts language models.arXiv preprint arXiv:2507.17702, 2025

  52. [52]

    Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664,

    Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664, 2024

  53. [53]

    Symphony-moe: Harmonizing disparate pre-trained models into a coherent mixture-of-experts.Proceedings of the AAAI Conference on Artificial Intelligence, 2026

    Qi Wang, Hanyang Peng, and Yue Yu. Symphony-moe: Harmonizing disparate pre-trained models into a coherent mixture-of-experts.Proceedings of the AAAI Conference on Artificial Intelligence, 2026. arXiv preprint arXiv:2509.18542

  54. [54]

    Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models.Transactions on Machine Learning Research, 2022

  55. [55]

    Grove MoE: Towards efficient and superior MoE LLMs with adjugate experts.arXiv preprint arXiv:2508.07785, 2025

    Haoyuan Wu, Haoxing Chen, Xiaodong Chen, Zhanchao Zhou, Tieyuan Chen, Yihong Zhuang, Guoshan Lu, Zenan Huang, Junbo Zhao, Lin Liu, Zhenzhong Lan, Bei Yu, and Jianguo Li. Grove MoE: Towards efficient and superior MoE LLMs with adjugate experts.arXiv preprint arXiv:2508.07785, 2025

  56. [56]

    SPARKLING: Balancing signal preservation and symmetry breaking for width-progressive learning.arXiv preprint arXiv:2602.02472, 2026

    Qifan Yu, Xinyu Ma, Zhijian Zhuo, Minrui Wang, Deyi Liu, Shiyi Zhan, Yiyuan Ma, Liang Xiang, Xingyan Bin, and Di He. SPARKLING: Balancing signal preservation and symmetry breaking for width-progressive learning.arXiv preprint arXiv:2602.02472, 2026

  57. [57]

    Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning,

    Ted Zadouri, Ahmet Üstün, Arash Ahmadian, Beyza Ermi¸ s, Acyr Locatelli, and Sara Hooker. Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning.arXiv preprint arXiv:2309.05444, 2023

  58. [58]

    BAM! just like that: Simple and efficient parameter upcycling for mixture of experts.arXiv preprint arXiv:2408.08274, 2024

    Qizhen Zhang, Nikolas Gritsch, Dwaraknath Gnaneshwar, Simon Guo, David Cairuz, Bharat Venkitesh, Jakob Foerster, Phil Blunsom, Sebastian Ruder, Ahmet Ustun, and Acyr Locatelli. BAM! just like that: Simple and efficient parameter upcycling for mixture of experts.arXiv preprint arXiv:2408.08274, 2024

  59. [59]

    empirically convex-like

    Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML), pages 928–936, 2003. 15 Appendix Contents Proof of Progressive Training Bound A Theoretical Justification for Gradient-Based Utility Scores B Model Configurations C Heuristic Upcycling M...