Recognition: 2 theorem links
· Lean TheoremExpert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
Pith reviewed 2026-05-12 01:58 UTC · model grok-4.3
The pith
Expert upcycling duplicates trained experts and continues pre-training to expand MoE capacity while preserving inference cost and cutting total training compute.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Given a trained E-expert model, the upcycling operator produces an mE-expert model by duplicating each expert m times and expanding the router accordingly, then runs continued pre-training that breaks the initial symmetry so the duplicated experts specialize. The quality gap between this warm-started model and one trained from scratch decomposes into a capacity term (addressed by the extra experts) and an initialization term (largely closed by the inherited representations). Utility-based expert selection, which ranks experts by gradient-based importance, more than triples the fraction of the gap closed under short continued pre-training budgets.
What carries the argument
The upcycling operator, which duplicates experts and extends the router while keeping top-K routing fixed, thereby providing a warm initialization whose symmetry is broken by subsequent continued pre-training.
If this is right
- Upcycled models reach the same validation loss as fixed-size baselines trained from scratch.
- Training compute measured in GPU hours drops by 32 percent in the 7B-to-13B regime.
- Utility-based selection triples gap closure when continued pre-training is budget-limited.
- The approach remains effective across different model scales, activation ratios, and MoE architectures.
Where Pith is reading between the lines
- The same duplication-plus-continued-training pattern could be applied to other sparse or modular architectures that currently require full retraining to increase capacity.
- If the initialization term shrinks predictably with expert count, practitioners could grow models incrementally rather than committing to a single large random-initialization run.
- The decomposition into capacity and initialization terms suggests a testable schedule: measure how much continued pre-training is needed for a given duplication factor before the gap saturates.
Load-bearing premise
Continued pre-training after expert duplication is sufficient to break symmetry and produce specialization comparable to training the larger model from random initialization.
What would settle it
An experiment in which the upcycled model, after the same total training budget as the from-scratch baseline, still shows a measurable gap in final validation loss that does not close even with extended continued pre-training.
Figures
read the original abstract
Mixture-of-Experts (MoE) has become the dominant architecture for scaling large language models: frontier models routinely decouple total parameters from per-token computation through sparse expert routing. Scaling laws show that under fixed active computation, model quality scales predictably with total parameters, and MoEs realize this by increasing expert count. However, training large MoEs is expensive, as memory requirements and inter-device communication both scale with total parameter count. We propose expert upcycling, a method for progressively expanding MoE capacity by increasing the number of experts during continued pre-training (CPT). Given a trained E-expert model, the upcycling operator constructs an mE-expert model through expert duplication and router extension while holding top-K routing fixed, preserving per-token inference cost. Duplication provides a warm initialization: the expanded model inherits the source checkpoint's learned representations, starting from a substantially lower loss than random initialization. Subsequent CPT then breaks the symmetry among duplicated experts to drive specialization. We formalize the upcycling operator and develop a theoretical framework decomposing the quality gap into a capacity term and an initialization term. We further introduce utility-based expert selection, which uses gradient-based importance scores to guide non-uniform duplication, more than tripling gap closure when CPT is limited. In our 7B-13B total parameter experiments, the upcycled model matches the fixed-size baseline on validation loss while saving 32% of GPU hours. Comprehensive ablations across model scales, activation ratios, MoE architectures, and training budgets yield a practical recipe for deploying expert upcycling, establishing it as a principled, compute-efficient alternative to training large MoE models from scratch.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes expert upcycling for Mixture-of-Experts (MoE) models: given a trained E-expert checkpoint, duplicate experts and extend the router to create an mE-expert model while preserving top-K routing and per-token inference cost. Continued pre-training (CPT) is then used to break symmetry among duplicates. A theoretical framework decomposes the quality gap into capacity and initialization terms; utility-based expert selection (gradient importance scores) is introduced to guide non-uniform duplication. In 7B-13B total-parameter experiments the upcycled model matches a fixed-size baseline on validation loss while using 32% fewer GPU hours, with ablations across scales, activation ratios, architectures, and budgets.
Significance. If the empirical match holds under the stated decomposition, the method supplies a concrete, lower-cost route to larger MoE capacity by exploiting warm-start representations rather than random initialization. The reported 32% GPU-hour saving and the utility-selection ablation (tripling gap closure under limited CPT) would be practically useful for frontier-scale training.
major comments (2)
- [§3 and §5.1] §3 (theoretical framework) and §5.1 (7B-13B results): the decomposition of the quality gap into additive capacity and initialization terms presupposes that CPT after duplication induces expert specialization comparable to training the larger model from scratch. No direct diagnostic (pairwise expert cosine similarity, activation overlap, or gradient correlation after CPT) is reported to confirm that duplicated experts have in fact diverged; without it the observed loss match could be explained entirely by the warm-start benefit, undermining the claimed capacity-scaling efficiency.
- [§5.2] Table 1 / §5.2 (utility-based selection): the claim that gradient-based importance scores more than triple gap closure is load-bearing for the practical recipe. The paper should show that the selected experts are measurably more diverse post-CPT than uniform duplication (e.g., via the same diversity metric used in the decomposition), otherwise the improvement could be an artifact of the particular importance estimator rather than a general principle.
minor comments (2)
- [§4.3] §4.3 (experimental details): data-exclusion rules, validation-set construction, and whether the baseline and upcycled runs used identical token budgets after the upcycling point are not stated explicitly; these details are needed to interpret the 32% GPU-hour figure.
- [Figures 3-5] Figures 3-5: axis labels and legend entries use inconsistent abbreviations for “upcycled” vs. “baseline”; a single consistent notation would improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our work. We address each of the major comments below and have made revisions to the manuscript to incorporate additional supporting analyses where appropriate.
read point-by-point responses
-
Referee: [§3 and §5.1] §3 (theoretical framework) and §5.1 (7B-13B results): the decomposition of the quality gap into additive capacity and initialization terms presupposes that CPT after duplication induces expert specialization comparable to training the larger model from scratch. No direct diagnostic (pairwise expert cosine similarity, activation overlap, or gradient correlation after CPT) is reported to confirm that duplicated experts have in fact diverged; without it the observed loss match could be explained entirely by the warm-start benefit, undermining the claimed capacity-scaling efficiency.
Authors: We appreciate this observation on the assumptions underlying our theoretical framework in §3. The decomposition separates the quality gap into initialization and capacity components based on the observed loss differences. While direct post-CPT diagnostics were not included in the original submission, the consistent matching of validation loss in our 7B-13B experiments, combined with ablations across scales and budgets, supports that specialization occurs during CPT. To provide explicit confirmation, we will add measurements of expert divergence (such as pairwise cosine similarities and activation overlap) after CPT in the revised manuscript. This will directly validate the capacity term and address the possibility that benefits are solely from warm-start. revision: yes
-
Referee: [§5.2] Table 1 / §5.2 (utility-based selection): the claim that gradient-based importance scores more than triple gap closure is load-bearing for the practical recipe. The paper should show that the selected experts are measurably more diverse post-CPT than uniform duplication (e.g., via the same diversity metric used in the decomposition), otherwise the improvement could be an artifact of the particular importance estimator rather than a general principle.
Authors: We agree that linking the utility-based selection to measurable increases in post-CPT diversity would strengthen the practical utility of the method. The empirical results in §5.2 and Table 1 show that utility selection leads to substantially better gap closure under limited CPT. To demonstrate this is due to greater diversity rather than estimator specifics, we will include in the revision a comparison of the diversity metric (from §3) for utility-selected vs. uniform duplication after CPT. This will confirm that the selected experts exhibit more specialization. revision: yes
Circularity Check
No significant circularity; empirical results are independent of any definitional decomposition
full rationale
The paper's central claim is an empirical efficiency result: an upcycled 7B-to-13B MoE matches a fixed-size baseline on validation loss while using 32% fewer GPU hours. This is supported by direct experiments and ablations across scales, activation ratios, and budgets rather than any derivation that reduces by the paper's equations to a fitted quantity or self-citation chain. The mentioned theoretical framework simply decomposes the observed quality gap into named capacity and initialization terms; this naming does not force the experimental outcome or substitute for the reported measurements. No load-bearing step equates a prediction to its own input by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearTheorem 3.1 (Expert upcycling bound) ... decomposes the quality gap into a capacity term and an initialization term
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclearutility-based expert selection, which uses gradient-based importance scores to guide non-uniform duplication
Reference graph
Works this paper leans on
-
[1]
Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali, Josh Susskind, and Vimal Thilak. Parameters vs FLOPs: Scaling laws for optimal sparsity for mixture-of- experts language models.arXiv preprint arXiv:2501.12370, 2025
-
[2]
Stacking as accelerated gradient descent, 2024
Naman Agarwal, Pranjal Awasthi, Satyen Kale, and Eric Zhao. Stacking as accelerated gradient descent, 2024. URLhttps://arxiv.org/abs/2403.04978
-
[3]
Deep progressive training: scaling up depth capacity of zero/one-layer models, 2025
Zhiqi Bu. Deep progressive training: scaling up depth capacity of zero/one-layer models, 2025. URLhttps://arxiv.org/abs/2511.04981
-
[4]
Zhiqi Bu, Shiyun Xu, and Jialin Mao. Convex dominance in deep learning I: A scaling law of loss and learning rate.arXiv preprint arXiv:2602.07145, 2026. Accepted to ICLR 2026
-
[5]
Net2net: Accelerating learning via knowl- edge transfer
Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowl- edge transfer. InInternational Conference on Learning Representations (ICLR), 2016. URL https://arxiv.org/abs/1511.05641
-
[6]
On the representation collapse of sparse mixture of experts
Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, Heyan Huang, and Furu Wei. On the representation collapse of sparse mixture of experts. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[7]
DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V . Le, Yonghui Wu, Zhifeng Chen,...
-
[9]
Stacking your transformers: A closer look at model growth for efficient LLM pre-training
Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold Cheng, Yike Guo, and Jie Fu. Stacking your transformers: A closer look at model growth for efficient LLM pre-training. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[10]
Revisiting MoE and dense speed-accuracy comparisons for LLM training
Xianzhi Du, Tom Gunter, Xiang Kong, Mark Lee, Zirui Wang, Aonan Zhang, Nan Du, and Ruoming Pang. Revisiting MoE and dense speed-accuracy comparisons for LLM training. arXiv preprint arXiv:2405.15052, 2024
-
[11]
Towards an empirical understanding of moe design choices.arXiv preprint arXiv:2402.13089, 2024
Dongyang Fan, Bettina Messmer, and Martin Jaggi. Towards an empirical understanding of moe design choices.arXiv preprint arXiv:2402.13089, 2024
-
[12]
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research (JMLR), 23(120):1–39, 2022. URLhttps://jmlr.org/papers/v23/21-0998.html
work page 2022
-
[13]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
GLM-4.5 Team. GLM-4.5: Agentic, reasoning, and coding (ARC) foundation models.arXiv preprint arXiv:2508.06471, 2025. URLhttps://arxiv.org/abs/2508.06471
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Nexus: Adaptive upcycling to efficiently pretrain mixture of experts
Nikolas Gritsch, Qizhen Zhang, Acyr Locatelli, Sara Hooker, and Ahmet Üstün. Nexus: Adaptive upcycling to efficiently pretrain mixture of experts. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 24364–24381. Association for Computational Linguistics, 2025
work page 2025
-
[15]
Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timothée Lesort
Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L. Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timothée Lesort. Continual pre-training of large language models: How to (re)warm your model?arXiv preprint arXiv:2308.04014, 2023. 12
-
[16]
Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 28, 2015
work page 2015
-
[17]
Second order derivatives for network pruning: Optimal brain surgeon
Babak Hassibi and David G Stork. Second order derivatives for network pruning: Optimal brain surgeon. InAdvances in Neural Information Processing Systems (NeurIPS), volume 5, pages 164–171, 1992
work page 1992
-
[18]
Elad Hazan.Introduction to Online Convex Optimization. MIT Press, 2nd edition, 2016
work page 2016
-
[19]
Upcycling large language models into mixture of experts.arXiv preprint arXiv:2410.07524, 2024
Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, and Bryan Catanzaro. Upcycling large language models into mixture of experts.arXiv preprint arXiv:2410.07524, 2024
-
[20]
Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. InProceedings of the IEEE International Conference on Computer Vision (ICCV), 2015
work page 2015
-
[21]
Yongqi Huang, Peng Ye, Chenyu Huang, Jianjian Cao, Lin Zhang, Baopu Li, Gang Yu, and Tao Chen. DeRS: Towards extremely efficient upcycled mixture-of-experts models.arXiv preprint arXiv:2503.01359, 2025. Accepted at CVPR 2025
-
[22]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Chao Jin, Ziheng Jiang, Zhihao Bai, Zheng Zhong, Juncai Liu, Xiang Li, Ningxin Zheng, Xi Wang, Cong Xie, Qi Huang, Wen Heng, Yiyuan Ma, Wenlei Bao, Size Zheng, Yanghua Peng, Haibin Lin, Xuanzhe Liu, Xin Jin, and Xin Liu. MegaScale-MoE: Large-scale communication- efficient training of mixture-of-experts models in production.arXiv preprint arXiv:2505.11432, 2025
-
[24]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017
work page 2017
-
[25]
Sparse upcycling: Training mixture-of-experts from dense checkpoints
Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints.arXiv preprint arXiv:2212.05055, 2022
-
[26]
arXiv preprint arXiv:2402.07871 , year=
Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygó´ zd´ z, Piotr Sankowski, Marek Cygan, and Sebastian Jaszczur. Scaling laws for fine-grained mixture of experts.arXiv preprint arXiv:2402.07871, 2024
-
[27]
Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. InAdvances in Neural Information Processing Systems (NeurIPS), volume 2, pages 598–605, 1989
work page 1989
-
[28]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with condi- tional computation and automatic sharding. InInternational Conference on Learning Represen- tations (ICLR), 2021. URLhttps://arxiv.org/abs/2006.16668
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[29]
Datacomp- LM : In search of the next generation of training sets for language models
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardn...
-
[30]
Yu Xin Li, Felix Dangel, Derek Tam, and Colin Raffel. Fishers for free? approximating the fisher information matrix by recycling the squared gradient accumulator. InProceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 ofProceedings of Machine Learning Research, pages 34252–34270. PMLR, 2025
work page 2025
-
[31]
Slimmoe: Structured compression of large moe models via expert slimming and distillation, 2025
Zichong Li, Chen Liang, Zixuan Zhang, Ilgee Hong, Young Jin Kim, Weizhu Chen, and Tuo Zhao. Slimmoe: Structured compression of large moe models via expert slimming and distillation.arXiv preprint arXiv:2506.18349, 2025
-
[32]
Scaling laws for upcycling mixture-of-experts language models.arXiv preprint arXiv:2502.03009, 2025
Seng Pei Liew, Takuya Kato, and Sho Takase. Scaling laws for upcycling mixture-of-experts language models.arXiv preprint arXiv:2502.03009, 2025
-
[33]
Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li. Not all experts are equal: Efficient expert pruning and skipping for mixture-of- experts large language models.arXiv preprint arXiv:2402.14800, 2024
-
[34]
Language Model Cascades: Token-Level Uncertainty and Beyond
Jan Ludziejewski, Maciej Píoro, Jakub Krajewski, Maciej Stefaniak, Michał Krutul, Jan Mała´snicki, Marek Cygan, Piotr Sankowski, Kamil Adamczewski, Piotr Miło´s, and Sebastian Jaszczur. Joint MoE scaling laws: Mixture of experts can be memory efficient. InProceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 ofProceedings...
work page internal anchor Pith review doi:10.48550/arxiv 2025
-
[35]
Llama 4: Natively multimodal foundation models
Meta AI. Llama 4: Natively multimodal foundation models. https://github.com/ meta-llama/llama-models, 2025. Model card available at https://github.com/ meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md
work page 2025
-
[36]
Pruning convolutional neural networks for resource efficient inference
Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. InInternational Conference on Learning Representations (ICLR), 2017
work page 2017
-
[37]
Kimi k2: Open agentic intelligence.Technical Report, 2025
Moonshot-AI. Kimi k2: Open agentic intelligence.Technical Report, 2025. URL https: //moonshotai.github.io/Kimi-K2/
work page 2025
-
[38]
Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan
Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models.arXiv preprint arXiv:2305.16264, 2023
-
[39]
Taishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota, and Jun Suzuki. Drop-upcycling: Training sparse mixture of experts with partial re-initialization.arXiv preprint arXiv:2502.19261, 2025
-
[40]
LSH-MoE: Communication-efficient MoE training via locality-sensitive hashing
Xiaonan Nie, Qibin Liu, Fangcheng Fu, Shenhan Zhu, Xupeng Miao, Xiaoyang Li, Yang Zhang, Shouda Liu, and Bin Cui. LSH-MoE: Communication-efficient MoE training via locality-sensitive hashing. InAdvances in Neural Information Processing Systems, volume 37, 2024
work page 2024
-
[41]
Reusing pretrained models by multi-linear operators for efficient training
Yu Pan, Ye Yuan, Yichun Yin, Zenglin Xu, Lifeng Shang, Xin Jiang, and Qun Liu. Reusing pretrained models by multi-linear operators for efficient training. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[42]
Efficient stagewise pretraining via progressive subnetworks.arXiv preprint arXiv:2402.05913, 2024
Abhishek Panigrahi, Nikunj Saunshi, Kaifeng Lyu, Sobhan Miryoosefi, Sashank Reddi, Satyen Kale, and Sanjiv Kumar. Efficient stagewise pretraining via progressive subnetworks.arXiv preprint arXiv:2402.05913, 2024
-
[43]
Reuse, don’t retrain: A recipe for continued pretraining of language models,
Jupinder Parmar, Sanjev Satheesh, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Reuse, don’t retrain: A recipe for continued pretraining of language models.arXiv preprint arXiv:2407.07263, 2024
- [44]
-
[45]
Fabian Schaipp, Aaron Defazio, Harsh Mehta, Konstantin Mishchenko, and Ahmed Khaled. The surprising agreement between convex optimization theory and learning-rate scheduling for large model training. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025. 14
work page 2025
-
[46]
Shai Shalev-Shwartz. Online learning and online convex optimization.Foundations and Trends in Machine Learning, 4(2):107–194, 2012
work page 2012
-
[47]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[48]
Trade-offs of diagonal fisher information matrix estimators
Alexander Soen and Ke Sun. Trade-offs of diagonal fisher information matrix estimators. In Advances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[49]
Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston, and Xian Li. Branch-train-MiX: Mixing expert LLMs into a mixture-of-experts LLM.arXiv preprint arXiv:2403.07816, 2024
-
[50]
Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
arXiv preprint arXiv:2507.17702 , year=
Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, and Jun Zhou. Towards greater leverage: Scaling laws for efficient mixture-of-experts language models.arXiv preprint arXiv:2507.17702, 2025
-
[52]
Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664,
Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664, 2024
-
[53]
Qi Wang, Hanyang Peng, and Yue Yu. Symphony-moe: Harmonizing disparate pre-trained models into a coherent mixture-of-experts.Proceedings of the AAAI Conference on Artificial Intelligence, 2026. arXiv preprint arXiv:2509.18542
-
[54]
Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models.Transactions on Machine Learning Research, 2022
work page 2022
-
[55]
Haoyuan Wu, Haoxing Chen, Xiaodong Chen, Zhanchao Zhou, Tieyuan Chen, Yihong Zhuang, Guoshan Lu, Zenan Huang, Junbo Zhao, Lin Liu, Zhenzhong Lan, Bei Yu, and Jianguo Li. Grove MoE: Towards efficient and superior MoE LLMs with adjugate experts.arXiv preprint arXiv:2508.07785, 2025
-
[56]
Qifan Yu, Xinyu Ma, Zhijian Zhuo, Minrui Wang, Deyi Liu, Shiyi Zhan, Yiyuan Ma, Liang Xiang, Xingyan Bin, and Di He. SPARKLING: Balancing signal preservation and symmetry breaking for width-progressive learning.arXiv preprint arXiv:2602.02472, 2026
-
[57]
Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning,
Ted Zadouri, Ahmet Üstün, Arash Ahmadian, Beyza Ermi¸ s, Acyr Locatelli, and Sara Hooker. Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning.arXiv preprint arXiv:2309.05444, 2023
-
[58]
Qizhen Zhang, Nikolas Gritsch, Dwaraknath Gnaneshwar, Simon Guo, David Cairuz, Bharat Venkitesh, Jakob Foerster, Phil Blunsom, Sebastian Ruder, Ahmet Ustun, and Acyr Locatelli. BAM! just like that: Simple and efficient parameter upcycling for mixture of experts.arXiv preprint arXiv:2408.08274, 2024
-
[59]
Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML), pages 928–936, 2003. 15 Appendix Contents Proof of Progressive Training Bound A Theoretical Justification for Gradient-Based Utility Scores B Model Configurations C Heuristic Upcycling M...
work page 2003
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.