pith. sign in

arxiv: 2509.21892 · v2 · submitted 2025-09-26 · 💻 cs.CL · cs.AI· cs.LG

Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts

Pith reviewed 2026-05-18 14:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords mixture of expertselastic scalinginference efficiencysparse modelsrouter trainingmodel deploymentvariable activation
0
0 comments X

The pith

Elastic training lets one MoE model keep improving when it activates two to three times more experts at inference than it saw in training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mixture-of-Experts models normally lock the number of active experts during both training and inference. Real deployments often need different speed-versus-quality points on different hardware, yet retraining a separate model for each point is expensive. The paper identifies that raising the expert count at inference beyond the training value quickly hurts accuracy because the experts never practiced working together in new groupings. Elastic MoE solves this by training the experts and router on many possible team sizes at once, so the same model stays effective across a wide range of active-expert counts.

Core claim

The central claim is that the observed inference-time scaling wall stems from experts lacking learned collaboration across different activation patterns, and that a training procedure which simultaneously exposes experts to diverse combinations while guiding the router toward high-quality selections removes the wall, allowing performance to scale up to 2-3 times the training-time activation count across multiple model sizes and tasks.

What carries the argument

The Elastic Mixture-of-Experts training framework that forces the router and experts to practice collaboration under many different numbers of active experts during the training phase.

If this is right

  • A single trained model can serve multiple quality-latency operating points without any retraining or model swapping.
  • Peak performance on downstream tasks rises compared with models trained at a fixed expert count.
  • The approach works across four different MoE architectures ranging from 7B to 21B parameters.
  • The usable inference scaling range expands from roughly the training k to two or three times that value on nine standard benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployment pipelines could adjust the number of active experts on the fly according to current load or hardware constraints instead of maintaining multiple model copies.
  • The same training pattern might help other sparsely activated architectures that currently suffer when their activation count changes after training.
  • Practitioners could test whether the method also stabilizes performance when the inference count drops below the training count, which the paper does not examine.
  • Future experiments could check whether the gains hold when the test distribution differs markedly from the training data mixtures used to create the diverse expert teams.

Load-bearing premise

The performance drop when activating extra experts at inference is caused by a lack of learned collaboration among experts that can be fixed simply by exposing the model to many different expert combinations while training.

What would settle it

Train an EMoE model on a standard benchmark then measure accuracy while increasing the inference activation count from 1x to 3x the training value; if accuracy falls sharply past 1.5x, the training change does not fully remove the scaling wall.

Figures

Figures reproduced from arXiv: 2509.21892 by HaiFeng Wang, Hua Wu, Naibin Gu, Peng Fu, Shuohuan Wang, Weiping Wang, Yilong Chen, Yuchen Feng, Yu Sun, Zheng Lin, Zhenyu Zhang.

Figure 1
Figure 1. Figure 1: Performance of MoE mod￾els trained with fixed k under varying inference-time activated experts (k ′ ). The color regions show where opti￾mal performance briefly holds. As shown in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of expert co-occurrence matrices. Panels show models trained with (a) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Co-occurrence distance vs. model performance for a model trained with k = 2. To quantify the impact of this co-occurrence disparity, we measure the Frobenius norm of the distance between the co￾occurrence matrix from training, M(k) , and the one from infer￾ence, M(k ′ ) : ∆(k → k ′ ) = ∥M(k) − M(k ′ ) ∥F . (4) This metric captures the distance in expert activation patterns. A small ∆ indicates that the exp… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of the standard Top-k MoE and our Elastic Mixture-of-Experts (EMoE). EMoE is designed to unlock scalability at inference time. For each input, it first forms a candidate pool Skideal of top-scoring experts. A smaller subset Sco-act is then uniformly drawn from this pool for computation. The total objective combines standard MoE losses with the hierarchical router loss LHR, which regularizes the … view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of expert co-occurrence matrices for (a) the standard Top- [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Analysis of the effect of the hyperparameter kideal. All experi￾ments are conducted with ktrain = 2. Effect of kideal. We conduct analysis on the key hyperpa￾rameter kideal in the co-activation sampling to verify the ro￾bustness of its configuration. The results in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Analysis of the effect of the hyperpa￾rameter kideal. All experiments are conducted with ktrain = 6 + 2. Effect of kideal on DeepSeekV2-Lite. We con￾duct an analysis of the hyperparameter kideal on the DeepSeekV2-Lite model to further validate the ro￾bustness of its configuration. The results in [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance evolution of EMoEco-act (i.e., only using co-activation sampling) versus the Top-k baseline at different training checkpoints. The subplots (a) through (d) show performance snapshots at the end of epochs 1, 2, 3, and 4, respectively. A.3 ALGORITHM OF EMOE Here, we present the complete algorithm of the proposed EMoE training framework in Algorithm ??. Algorithm 1 Elastic Mixture-of-Experts (EMoE… view at source ↗
read the original abstract

Mixture-of-Experts (MoE) models typically fix the number of activated experts $k$ at both training and inference. However, real-world deployments often face heterogeneous hardware, fluctuating workloads, and diverse quality-latency requirements, while training separate models for each scenario is costly. Considering that MoE models already operate with sparse activation, adjusting the number of activated experts offers a natural path to serving diverse budgets with a single model. Yet, we find that activating more experts $k'$ ($> k$) at inference does not yield the expected gains. Instead, performance degrades rapidly after only a slight increase, a phenomenon we term the \textit{inference-time scaling wall}. Further investigation reveals that this degradation stems from a lack of learned collaboration among experts. To address this, we introduce \textbf{Elastic Mixture-of-Experts (EMoE)}, a novel training framework that enables MoE models to elastically vary the number of activated experts at inference. By simultaneously training experts to collaborate in diverse combinations and encouraging the router to make high-quality selections, EMoE ensures robust performance across inference budgets. Extensive experiments across four MoE architectures (7B--21B) and nine benchmarks show that EMoE significantly expands the effective scaling range to 2-3$\times$ the training-time $k$, while also achieving higher peak performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper identifies an 'inference-time scaling wall' in Mixture-of-Experts models, where activating k' > k experts at inference causes rapid performance degradation. It attributes this to insufficient learned collaboration among experts and introduces Elastic MoE (EMoE), a training framework that simultaneously optimizes experts for diverse combinations and improves router quality. Experiments across four MoE architectures (7B–21B) and nine benchmarks claim that EMoE expands the effective scaling range to 2–3× the training-time k while also raising peak performance.

Significance. If the central mechanism holds, EMoE would allow a single trained MoE model to serve heterogeneous hardware, workloads, and quality-latency trade-offs without retraining separate models for each k, reducing deployment overhead. The evaluation spans multiple architectures and benchmarks, which is a positive indicator of generality for an empirical method.

major comments (2)
  1. [Abstract] Abstract: The diagnosis that degradation 'stems from a lack of learned collaboration among experts' is load-bearing for the EMoE objective. An alternative—that the router trained only on the original top-k produces mis-calibrated scores for larger sets—is not isolated. No ablation is described that holds the router fixed while varying only expert co-training, nor are router calibration metrics (e.g., rank correlation of router scores with held-out expert utility) reported at k' = 2k.
  2. [Abstract] Abstract and experimental description: The claim of 'consistent gains' and '2-3× the training-time k' is presented without specifying exact baselines, ablation controls, statistical significance tests, or the precise metric used to quantify the inference-time scaling wall. These omissions directly affect assessment of whether the reported expansion is robust.
minor comments (1)
  1. The abstract introduces the term 'inference-time scaling wall' without situating it against prior observations of MoE scaling behavior; a short related-work sentence would improve context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, clarifying our existing analyses where possible and outlining targeted revisions to improve precision and isolation of effects.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The diagnosis that degradation 'stems from a lack of learned collaboration among experts' is load-bearing for the EMoE objective. An alternative—that the router trained only on the original top-k produces mis-calibrated scores for larger sets—is not isolated. No ablation is described that holds the router fixed while varying only expert co-training, nor are router calibration metrics (e.g., rank correlation of router scores with held-out expert utility) reported at k' = 2k.

    Authors: We appreciate the referee highlighting the importance of isolating the root cause. Section 4.2 of the manuscript already includes experiments that fix the router (using both the trained router and an oracle router) while varying expert training objectives, showing that performance degradation persists even with improved routing, which supports the collaboration hypothesis. However, to more rigorously rule out the mis-calibration alternative, we will add a new ablation that holds the router weights completely fixed from the baseline training run and retrains only the experts under the EMoE co-training objective. We will also report router calibration metrics, including rank correlation between router scores and held-out expert utilities, evaluated at k' = 2k. revision: partial

  2. Referee: [Abstract] Abstract and experimental description: The claim of 'consistent gains' and '2-3× the training-time k' is presented without specifying exact baselines, ablation controls, statistical significance tests, or the precise metric used to quantify the inference-time scaling wall. These omissions directly affect assessment of whether the reported expansion is robust.

    Authors: We agree that explicit specification of these elements is essential. In the revised manuscript we will: (1) state the exact baselines (standard MoE trained and evaluated at the original k, plus dense models of comparable size); (2) detail all ablation controls including the fixed-router and oracle-router variants; (3) report statistical significance via paired t-tests over three random seeds for key results; and (4) define the scaling-wall metric as the largest k' at which average performance across benchmarks remains within 1% of the model's peak accuracy (or does not fall below the single-expert baseline). These clarifications will be added to both the abstract and the experimental section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training framework with independent experimental validation

full rationale

The paper presents an entirely empirical contribution: it observes an inference-time scaling wall in MoE models, attributes it to insufficient expert collaboration via investigation, and proposes EMoE as a training procedure to expose experts to diverse combinations. No equations, derivations, or fitted parameters are shown to reduce the claimed gains (2-3× scaling range, higher peak performance) to quantities defined by construction from the inputs. The central claims rest on experiments across four architectures and nine benchmarks rather than self-referential definitions or load-bearing self-citations. The diagnosis and remedy are falsifiable outside the paper's own fitted values and do not collapse into tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that expert collaboration can be learned through multi-combination training and that this transfers to unseen inference-time k values; no new free parameters or invented physical entities are introduced beyond the standard MoE router and expert parameters.

axioms (1)
  • domain assumption Training experts on diverse activation combinations during training produces collaboration that generalizes to inference-time changes in k.
    Invoked to explain why simply increasing k fails and why the new training framework succeeds.
invented entities (1)
  • Elastic Mixture-of-Experts (EMoE) training framework no independent evidence
    purpose: Enable elastic variation of activated experts at inference while preserving performance
    New training procedure introduced to address the inference-time scaling wall; no independent falsifiable prediction outside the reported experiments is provided.

pith-pipeline@v0.9.0 · 5810 in / 1335 out tokens · 45989 ms · 2026-05-18T14:14:59.010207+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

    cs.LG 2026-05 unverdicted novelty 7.0

    MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.

  2. Elastic Attention Cores for Scalable Vision Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...

  3. MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

    cs.LG 2026-05 unverdicted novelty 6.0

    ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.

  4. Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start

    cs.DC 2026-04 unverdicted novelty 6.0

    Foundry uses template-based CUDA graph context materialization to reduce LLM serving cold-start latency by up to 99% while preserving CUDA graph throughput gains.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 3 Pith papers · 13 internal anchors

  1. [1]

    Mixture-of-recursions: Learning dynamic recur- sive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524,

    Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, and Se - Young Yun. Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation. CoRR, abs/2507.10524, 2025. doi:10.48550/ARXIV.2507.10524. URL https://doi.org/10.48550/arXiv.2507.10524

  2. [2]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond \' e de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bava...

  3. [3]

    Inner thinking transformer: Leveraging dynamic depth scaling to foster adaptive internal thinking

    Yilong Chen, Junyuan Shang, Zhenyu Zhang, Yanxi Xie, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, and Haifeng Wang. Inner thinking transformer: Leveraging dynamic depth scaling to foster adaptive internal thinking. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of ...

  4. [4]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457

  5. [5]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168

  6. [6]

    Opencompass: A universal evaluation platform for foundation models

    OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023

  7. [7]

    Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. In Lun - Wei Ku, Andre Martins, and Vivek Srikumar (eds.)...

  8. [8]

    DeepSeek - AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, Hao Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian...

  9. [9]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

  10. [10]

    Dhillon, Yulia Tsvetkov, Hanna Hajishirzi, Sham M

    Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit S. Dhillon, Yulia Tsvetkov, Hanna Hajishirzi, Sham M. Kakade, Ali Farhadi, and Prateek Jain. Matformer: Nested transformer for elastic inference. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances...

  11. [11]

    Loramoe: Revolutionizing mixture of experts for maintaining world knowledge in language model alignment, 2023

    Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Jun Zhao, Wei Shen, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Xiaoran Fan, Shiliang Pu, Jiang Zhu, Rui Zheng, Tao Gui, Qi Zhang, and Xuanjing Huang. Loramoe: Revolutionizing mixture of experts for maintaining world knowledge in language model alignment, 2023

  12. [12]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res., 23: 0 120:1--120:39, 2022. URL https://jmlr.org/papers/v23/21-0998.html

  13. [13]

    Hydravit: Stacking heads for a scalable vit

    Janek Haberer, Ali Hojjat, and Olaf Landsiedel. Hydravit: Stacking heads for a scalable vit. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouve...

  14. [14]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

  15. [15]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

  16. [16]

    Harder task needs more experts: Dynamic routing in M o E models

    Quzhe Huang, Zhenwei An, Nan Zhuang, Mingxu Tao, Chen Zhang, Yang Jin, Kun Xu, Kun Xu, Liwei Chen, Songfang Huang, and Yansong Feng. Harder task needs more experts: Dynamic routing in M o E models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Lon...

  17. [17]

    Upcycling instruction tuning from dense to mixture-of-experts via parameter merging

    Tingfeng Hui, Zhenyu Zhang, Shuohuan Wang, Yu Sun, Hua Wu, and Sen Su. Upcycling instruction tuning from dense to mixture-of-experts via parameter merging. CoRR, abs/2410.01610, 2024. doi:10.48550/ARXIV.2410.01610. URL https://doi.org/10.48550/arXiv.2410.01610

  18. [18]

    Adaptive Mixtures of Local Experts

    Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts. Neural Comput., 3 0 (1): 0 79--87, 1991. doi:10.1162/NECO.1991.3.1.79. URL https://doi.org/10.1162/neco.1991.3.1.79

  19. [19]

    Moe++: Accelerating mixture-of-experts methods with zero-computation experts

    Peng Jin, Bo Zhu, Li Yuan, and Shuicheng Yan. Moe++: Accelerating mixture-of-experts methods with zero-computation experts. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=t7P5BUKcYv

  20. [20]

    TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min - Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Pape...

  21. [21]

    Lost in the Middle: How Language Models Use Long Contexts

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur P. Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming - Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question answering research. Tr...

  22. [22]

    Gshard: Scaling giant models with conditional computation and automatic sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021. URL http...

  23. [23]

    Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification, 2023

    Wing Lian, Guan Wang, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong, and "Teknium". Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification, 2023. URL https://https://huggingface.co/Open-Orca/SlimOrca

  24. [24]

    OLMoE: Open Mixture-of-Experts Language Models

    Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, and Hannaneh Hajishirzi. O...

  25. [25]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:266362871

  26. [26]

    From sparse to soft mixtures of experts

    Joan Puigcerver, Carlos Riquelme Ruiz, Basil Mustafa, and Neil Houlsby. From sparse to soft mixtures of experts. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=jxpsAj7ltE

  27. [27]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Art...

  28. [28]

    Le, Geoffrey E

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings . OpenReview.net, 2017. URL h...

  29. [29]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters. CoRR, abs/2408.03314, 2024. doi:10.48550/ARXIV.2408.03314. URL https://doi.org/10.48550/arXiv.2408.03314

  30. [30]

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Ha...

  31. [31]

    Qwen1.5-moe: Matching 7b model performance with 1/3 activated parameters", February 2024

    Qwen Team. Qwen1.5-moe: Matching 7b model performance with 1/3 activated parameters", February 2024. URL https://qwenlm.github.io/blog/qwen-moe/

  32. [32]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023 a . URL https://api.semanticscholar...

  33. [33]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton - Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Har...

  34. [34]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference o...

  35. [35]

    An Wang, Xingwu Sun, Ruobing Xie, Shuaipeng Li, Jiaqi Zhu, Zhen Yang, Pinxue Zhao, J. N. Han, Zhanhui Kang, Di Wang, Naoaki Okazaki, and Cheng - Zhong Xu. Hmoe: Heterogeneous mixture of experts for language modeling. CoRR, abs/2408.10681, 2024 a . doi:10.48550/ARXIV.2408.10681. URL https://doi.org/10.48550/arXiv.2408.10681

  36. [36]

    Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

    Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts. CoRR, abs/2408.15664, 2024 b . doi:10.48550/ARXIV.2408.15664. URL https://doi.org/10.48550/arXiv.2408.15664

  37. [37]

    Remoe: Fully differentiable mixture-of-experts with relu routing

    Ziteng Wang, Jun Zhu, and Jianfei Chen. Remoe: Fully differentiable mixture-of-experts with relu routing. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=4D0f16Vwc3

  38. [38]

    Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023

    Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Source code is all you need. CoRR, abs/2312.02120, 2023. doi:10.48550/ARXIV.2312.02120. URL https://doi.org/10.48550/arXiv.2312.02120

  39. [39]

    Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu

    Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https:...

  40. [40]

    URL https:// doi.org/10.18653/v1/p19-1472

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Anna Korhonen, David R. Traum, and Llu \' s M \` a rquez (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers , pp...

  41. [41]

    Adamoe: Token-adaptive routing with null experts for mixture-of-experts language models

    Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, and Zhijie Deng. Adamoe: Token-adaptive routing with null experts for mixture-of-experts language models. In Yaser Al - Onaizan, Mohit Bansal, and Yun - Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-16, 2024 , pp.\ 6223--6235. Assoc...

  42. [42]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  43. [43]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  44. [44]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  45. [45]

    -2.8pt 2.9mm tabular @ lcccc@ & k'=1 & k'=2 & k'=4 & k'=6 \\ Top- k & 45.48 & 47.88 & 48.05 & 47.57 \\ EMoE & 45.68 & 48.22 & 49.00 & 49.50 \\ \ \ \ w/o co-act

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...