pith. machine review for the scientific record. sign in

arxiv: 2605.13247 · v1 · submitted 2026-05-13 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

EMO: Frustratingly Easy Progressive Training of Extendable MoE

Chufan Shi, Eric Xing, Huijuan Wang, Linghao Jin, Nuan Wen, Xuezhe Ma, Zhengzhong Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:02 UTC · model grok-4.3

classification 💻 cs.LG
keywords mixture of expertsprogressive trainingscaling lawsefficient trainingsparse modelslarge language models
0
0 comments X

The pith

Progressive expert growth in MoE training reaches full fixed-expert performance with lower wall-clock cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard MoE setups waste resources by activating a large expert pool from the first training step, since early data cannot make full use of that capacity. EMO instead starts with fewer experts and adds more at later stages. It incorporates sparsity patterns into scaling laws to set the right token budget before each expansion. Large-scale runs show the final model matches the quality of a static full-expert baseline while finishing faster and using fewer GPU hours. The approach treats total expert count as expandable memory rather than a fixed starting cost.

Core claim

EMO is a progressive training framework for Mixture-of-Experts models that grows the expert pool incrementally during training. It models sparsity within the scaling law to compute optimal token budgets for each expansion stage. This yields models that perform as well as those trained with a static full expert set from the beginning, but at reduced wall-clock time and lower overall GPU cost.

What carries the argument

Stage-wise compute-optimal token budgets derived from modeling sparsity in the scaling law, which dictate when and by how much to expand the expert pool.

If this is right

  • MoE training becomes feasible on hardware that cannot hold the full expert pool in memory from the start.
  • Early-phase communication and memory overhead drop because fewer experts are active.
  • Total wall-clock time and GPU hours decrease while the per-token compute benefit of sparsity is retained.
  • Larger target expert counts can be reached without a proportional rise in training duration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same progressive logic could be tested on other capacity axes such as hidden dimension or number of layers.
  • Practitioners could combine staged expert growth with existing parallelism techniques to train on smaller clusters.
  • Direct measurements of expert utilization per training phase would strengthen or refute the motivating assumption.

Load-bearing premise

Early training phases do not fully utilize large expert capacity, so adding experts only later preserves final performance.

What would settle it

A fixed large-expert model trained for exactly the same total tokens as an EMO schedule but reaching strictly better final loss would falsify the performance-parity claim.

Figures

Figures reproduced from arXiv: 2605.13247 by Chufan Shi, Eric Xing, Huijuan Wang, Linghao Jin, Nuan Wen, Xuezhe Ma, Zhengzhong Liu.

Figure 1
Figure 1. Figure 1: Increasing expert count E with fixed top-k acti￾vated experts substantially slows down training, especially at larger scales. A4B denotes 4B activated parameters (out of 36B total at E=128); A1.1B denotes 1.1B activated pa￾rameters (out of 9.6B at E=128). All experiments are con￾ducted on 4 nodes of 8×H200 GPUs. Sparse Mixture-of-Experts (MoE) architectures decouple model capacity from per-token com￾putati… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of EMO. EMO performs multi-step expansions; each step we increase model’s total expert number with appropriate initialization for new experts and routers [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Stage-wise, expert-aware token allocation. We study how to optimally allocate tokens in progressive training given fixed activated parameters and token budget. As sparsity-aware scaling law makes progressive training predictable, we estimate cumulative per-expert optimal token allocations first, then normalize them into our exapansion schedule with total token budget. With the same data as our main experim… view at source ↗
Figure 5
Figure 5. Figure 5: Validating Token Allocation: increasing experts E = 16 → 32 @25%, 50% and 70%. The scaling law targets the right region. Fig￾ure 5 shows the final losses of all three expansions fall between the fixed E = 16 and fixed E = 32 baselines. Expanding at 25% achieves the lowest loss (1.069), while expanding at 50% and 75% reach 1.071 and 1.076 respectively—each step later in timing costs quality but saves wall-c… view at source ↗
Figure 6
Figure 6. Figure 6: Downstream curves across different expansion timing (E = 16 → 32). Comparing to Fixed_E=32, EMO@25% outperforms on both MMLU and GSM8K, performs comparably on HellaSwag and ARC-E. Even EMO@75% performs much better than Fixed_E=16. 3.3 Expert Expansion and Initialization Consider an expansion step s that increases the expert number from Es−1 to Es, three components must be initialized: the new expert weight… view at source ↗
Figure 7
Figure 7. Figure 7: Training-loss comparisons under fixed FLOPs. EMO starts from E = 8 and progressively expands to E = 128. EMO reaches a comparable loss as Fixed_E=128 baseline while being more efficient in training and GPU memory. EMO greatly outperforms Fixed_E=32 and Fixed_E=16. size of the transient spike at expansion. For elegance, in all main experiments, we use Gaussian initialization for new experts and router weigh… view at source ↗
Figure 8
Figure 8. Figure 8: Benchmark curves during training. We evaluate EMO and fixed-expert baselines on eight downstream benchmarks. EMO is competitive with or stronger than Fixed_E=128. Meanwhile, EMO consistently exceeds Fixed_E=32 and Fixed_E=16 in downstream tasks. Baselines We compare EMO against three from-scratch baselines trained with fixed expert pool E ∈ {16, 32, 128}, with the same total token budget and identical hype… view at source ↗
Figure 9
Figure 9. Figure 9: Training data mix. Data. We pretrain on a mixture of web, code, mathemati￾cal, and multilingual corpora following standard large-scale pretraining practices. The total token budget is fixed at 1.92T tokens across all runs. Validation perplexity is eval￾uated every 5K steps on held-out web, multilingual, code, academic, and other validation slices. Downstream eval￾uation is also run every 5K steps on BoolQ,… view at source ↗
Figure 11
Figure 11. Figure 11: Expert utilization on validation data. Top: per-layer × per-expert utilization; bottom-left: utiliza￾tion curves aggregate all layers; bottom-right: per-layer Gini summarizes imbalance (0 = uniform, 1 = col￾lapsed). loss of 1.017 versus 0.994. At the same time, EMO clearly outperforms smaller fixed-expert baselines such as Fixed_E = 16 and Fixed_E = 32, showing that progressive expansion avoids the capaci… view at source ↗
Figure 12
Figure 12. Figure 12: MoE as expandable memory. We evaluate parts of our scaling law MoE models on several world knowledge benchmarks (e.g.,TriviaQA (Joshi et al., 2017), NQ (Kwiatkowski et al., 2019)).We evaluate multiple commonsense benchmarks including HellaSwag (Zellers et al., 2019), WinoGrande Sakaguchi et al. (2021) etc.; math is evalated on GSM-8K Cobbe et al. (2021). For reference, the gray curve shows the FIXED_E= 16… view at source ↗
Figure 14
Figure 14. Figure 14: Validation Perplexity of expansion timing experiments (Expand@25%,50%, 75%). Baselines are Fixed_E=16 and Fixed_E=32. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Validation Perplexity of main experiments. Green lines are our progressive training ppls, red lines are baselines from Fixed_E=16, Fixed_E=32 and Fixed_E=128. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
read the original abstract

Sparse Mixture-of-Experts (MoE) models offer a powerful way to scale model size without increasing compute, as per-token FLOPs depend only on k active experts rather than the total pool of E experts. Yet, this asymmetry creates an MoE efficiency paradox in practice: adding more experts balloons memory and communication costs, making actual training inefficient. We argue that this bottleneck arises in part because current MoE training allocates too many experts from the beginning, even though early-stage data may not fully utilize such capacity. Motivated by this, we propose EMO, a simple progressive training framework that treats MoE capacity as expandable memory and grows the expert pool over the course of training. EMO explicitly models sparsity in scaling law to derive stage-wise compute-optimal token budgets for progressive expansion. Empirical results show that EMO matches the performance of a fixed-expert setup in large-scale experiments while improving wall-clock efficiency. It offers a surprisingly simple yet effective path to scalable MoE training, preserving the benefits of large expert pools while reducing both training time and GPU cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes EMO, a progressive training framework for sparse Mixture-of-Experts (MoE) models. It posits that early-stage data under-utilizes large expert pools, derives stage-wise token budgets from scaling-law sparsity to schedule expert-pool growth, and reports that the resulting models match the final performance of fixed large-expert baselines while reducing wall-clock training time and GPU memory footprint.

Significance. If the empirical parity holds and the progressive schedule is shown to be the causal factor, EMO would offer a practical, low-overhead route to training larger MoE models without paying the full memory and communication cost from step one. The explicit use of scaling-law sparsity to set expansion points is a methodological strength that could generalize beyond the reported experiments.

major comments (2)
  1. [Abstract, §4] Abstract and §4: the central claim that EMO 'matches the performance of a fixed-expert setup' is asserted without any reported numbers, baselines, or per-stage expert-utilization statistics (activation rates, routing entropy, or gradient contribution per expert). Without these data it is impossible to confirm that the progressive mechanism, rather than the final expert count or total token budget, is responsible for the observed parity.
  2. [§3.2] §3.2: the derivation of stage-wise compute-optimal token budgets from 'sparsity in scaling laws' is described at a high level but supplies neither the explicit functional form nor the value of the sparsity parameter used; it is therefore unclear whether the schedule is truly parameter-free or whether it was tuned post-hoc on the same runs that are later used to claim efficiency gains.
minor comments (2)
  1. [§3] Notation for the number of experts E and the per-stage token budget T_s is introduced without a consolidated table; a single reference table would improve readability.
  2. [Abstract, §4] The abstract states 'large-scale experiments' but does not specify model sizes, dataset, or hardware; these details should appear in the first paragraph of §4.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We will revise the manuscript to strengthen the empirical presentation and provide the explicit derivation details. Our responses to the major comments follow.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4: the central claim that EMO 'matches the performance of a fixed-expert setup' is asserted without any reported numbers, baselines, or per-stage expert-utilization statistics (activation rates, routing entropy, or gradient contribution per expert). Without these data it is impossible to confirm that the progressive mechanism, rather than the final expert count or total token budget, is responsible for the observed parity.

    Authors: We agree that additional quantitative details are needed to isolate the contribution of the progressive schedule. In the revision we will expand §4 with explicit performance numbers (e.g., final perplexity or downstream accuracy) for EMO versus fixed-expert baselines at identical total token budgets, plus per-stage tables reporting expert activation rates, routing entropy, and average gradient norms per expert. These additions will show that parity is achieved only when the expansion schedule is followed and not when the same final expert count is used from the start. revision: yes

  2. Referee: [§3.2] §3.2: the derivation of stage-wise compute-optimal token budgets from 'sparsity in scaling laws' is described at a high level but supplies neither the explicit functional form nor the value of the sparsity parameter used; it is therefore unclear whether the schedule is truly parameter-free or whether it was tuned post-hoc on the same runs that are later used to claim efficiency gains.

    Authors: We will add the explicit functional form in §3.2: the stage-wise token budget T_s for expert pool size E_s is given by T_s = C · E_s^α where α is the sparsity exponent taken from the MoE scaling-law literature (α ≈ 0.5 for the reported regime) and C is a constant set by the target compute budget. The value of α and the full derivation from the effective-parameter scaling relation will be stated, together with a short proof that the schedule depends only on publicly known scaling-law coefficients and not on post-hoc fitting to the EMO runs. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external scaling laws and independent experiments

full rationale

The paper motivates progressive expert growth from the premise that early data under-utilizes large pools and derives stage-wise token budgets by modeling sparsity in scaling laws presented as external input. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described chain. The central performance-matching claim is supported by large-scale empirical results rather than reducing to its own inputs by construction. This is the expected non-finding for a method paper whose key steps remain falsifiable outside the fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; the scaling-law model of sparsity and the assumption that early data under-utilizes expert capacity are treated as given rather than derived.

pith-pipeline@v0.9.0 · 5505 in / 965 out tokens · 37683 ms · 2026-05-14T20:02:06.262863+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 7 internal anchors

  1. [1]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432,

  2. [2]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  3. [3]

    DeepSeek-V3 Technical Report

    DeepSeek-AI. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437,

  4. [4]

    Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, and Bryan Catanzaro

    URLhttps://zenodo .org/records/12608602. Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, and Bryan Catanzaro. Upcycling large language models into mixture of experts.arXiv preprint arXiv:2410.07524,

  5. [5]

    FastMoE: A fast mixture-of-expert training system.arXiv preprint arXiv:2103.13262,

    Jiaao He, Jiezhong Qiu, Aohan Zeng, Zhilin Yang, Jidong Zhai, and Jie Tang. FastMoE: A fast mixture-of-expert training system.arXiv preprint arXiv:2103.13262,

  6. [6]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024a. Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, and Yida Wang. Lancet: Accelerating mixture-of- experts...

  7. [7]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434,

  8. [8]

    Unchosen experts can contribute too: Unleashing moe models’ power by self-contrast.Advances in Neural Information Processing Systems, 37:136897–136921, 2024a

    Chufan Shi, Cheng Yang, Xinyu Zhu, Jiahao Wang, Taiqiang Wu, Siheng Li, Deng Cai, Yujiu Yang, and Yu Meng. Unchosen experts can contribute too: Unleashing moe models’ power by self-contrast.Advances in Neural Information Processing Systems, 37:136897–136921, 2024a. Chufan Shi, Haoran Yang, Deng Cai, Zhisong Zhang, Yifan Wang, Yujiu Yang, and Wai Lam. A th...

  9. [9]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

  10. [10]

    Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv: 2408.15664,

    Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664,

  11. [11]

    Scalable training of mixture-of-experts models with megatron core.arXiv preprint arXiv:2603.07685,

    Zijie Yan, Hongxiao Bai, Xin Yao, Dennis Liu, Tong Liu, Hongbin Liu, Pingtian Li, Evan Wu, Shiqing Fan, Li Tao, et al. Scalable training of mixture-of-experts models with megatron core.arXiv preprint arXiv:2603.07685,

  12. [12]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906,

  13. [13]

    15 Appendix A Preliminaries A.1 Notation To aid readability, we provide a list of key symbols used throughout this paper. Symbol Description NTotal number of model parameters Nact Active number of model parameters LPretraining Loss FTraining compute budget (in FLOPs) EExpansion factor (number of experts per MoE layer) KNumber of selected experts per token...

  14. [14]

    All numbers are accuracy (%)

    39.29 38.88 29.44 63.51 69.91 66.85 27.14 39.92 24.20 68.00 56.99 5.65 14.62 37.36 52.51 Stage 2 (8→16) 40.21 39.81 31.76 63.09 70.13 67.80 27.98 40.23 24.80 73.00 56.20 5.24 13.98 37.91 51.95 Stage 3 (16→32) 44.27 41.93 32.79 66.89 71.27 69.72 36.16 41.15 23.00 74.00 57.85 7.23 17.76 38.51 53.20 Stage 4 (32→64) 46.34 44.33 36.48 69.47 73.50 70.52 43.29 4...