Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts

HaiFeng Wang; Hua Wu; Naibin Gu; Peng Fu; Shuohuan Wang; Weiping Wang; Yilong Chen; Yuchen Feng; Yu Sun; Zheng Lin

arxiv: 2509.21892 · v2 · submitted 2025-09-26 · 💻 cs.CL · cs.AI· cs.LG

Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts

Naibin Gu , Zhenyu Zhang , Yuchen Feng , Yilong Chen , Peng Fu , Zheng Lin , Shuohuan Wang , Yu Sun

show 3 more authors

Hua Wu Weiping Wang HaiFeng Wang

This is my paper

Pith reviewed 2026-05-18 14:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords mixture of expertselastic scalinginference efficiencysparse modelsrouter trainingmodel deploymentvariable activation

0 comments

The pith

Elastic training lets one MoE model keep improving when it activates two to three times more experts at inference than it saw in training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mixture-of-Experts models normally lock the number of active experts during both training and inference. Real deployments often need different speed-versus-quality points on different hardware, yet retraining a separate model for each point is expensive. The paper identifies that raising the expert count at inference beyond the training value quickly hurts accuracy because the experts never practiced working together in new groupings. Elastic MoE solves this by training the experts and router on many possible team sizes at once, so the same model stays effective across a wide range of active-expert counts.

Core claim

The central claim is that the observed inference-time scaling wall stems from experts lacking learned collaboration across different activation patterns, and that a training procedure which simultaneously exposes experts to diverse combinations while guiding the router toward high-quality selections removes the wall, allowing performance to scale up to 2-3 times the training-time activation count across multiple model sizes and tasks.

What carries the argument

The Elastic Mixture-of-Experts training framework that forces the router and experts to practice collaboration under many different numbers of active experts during the training phase.

If this is right

A single trained model can serve multiple quality-latency operating points without any retraining or model swapping.
Peak performance on downstream tasks rises compared with models trained at a fixed expert count.
The approach works across four different MoE architectures ranging from 7B to 21B parameters.
The usable inference scaling range expands from roughly the training k to two or three times that value on nine standard benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployment pipelines could adjust the number of active experts on the fly according to current load or hardware constraints instead of maintaining multiple model copies.
The same training pattern might help other sparsely activated architectures that currently suffer when their activation count changes after training.
Practitioners could test whether the method also stabilizes performance when the inference count drops below the training count, which the paper does not examine.
Future experiments could check whether the gains hold when the test distribution differs markedly from the training data mixtures used to create the diverse expert teams.

Load-bearing premise

The performance drop when activating extra experts at inference is caused by a lack of learned collaboration among experts that can be fixed simply by exposing the model to many different expert combinations while training.

What would settle it

Train an EMoE model on a standard benchmark then measure accuracy while increasing the inference activation count from 1x to 3x the training value; if accuracy falls sharply past 1.5x, the training change does not fully remove the scaling wall.

Figures

Figures reproduced from arXiv: 2509.21892 by HaiFeng Wang, Hua Wu, Naibin Gu, Peng Fu, Shuohuan Wang, Weiping Wang, Yilong Chen, Yuchen Feng, Yu Sun, Zheng Lin, Zhenyu Zhang.

**Figure 1.** Figure 1: Performance of MoE models trained with fixed k under varying inference-time activated experts (k ′ ). The color regions show where optimal performance briefly holds. As shown in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Visualization of expert co-occurrence matrices. Panels show models trained with (a) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Co-occurrence distance vs. model performance for a model trained with k = 2. To quantify the impact of this co-occurrence disparity, we measure the Frobenius norm of the distance between the cooccurrence matrix from training, M(k) , and the one from inference, M(k ′ ) : ∆(k → k ′ ) = ∥M(k) − M(k ′ ) ∥F . (4) This metric captures the distance in expert activation patterns. A small ∆ indicates that the exp… view at source ↗

**Figure 4.** Figure 4: Comparison of the standard Top-k MoE and our Elastic Mixture-of-Experts (EMoE). EMoE is designed to unlock scalability at inference time. For each input, it first forms a candidate pool Skideal of top-scoring experts. A smaller subset Sco-act is then uniformly drawn from this pool for computation. The total objective combines standard MoE losses with the hierarchical router loss LHR, which regularizes the … view at source ↗

**Figure 6.** Figure 6: Visualization of expert co-occurrence matrices for (a) the standard Top- [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 5.** Figure 5: Analysis of the effect of the hyperparameter kideal. All experiments are conducted with ktrain = 2. Effect of kideal. We conduct analysis on the key hyperparameter kideal in the co-activation sampling to verify the robustness of its configuration. The results in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: Analysis of the effect of the hyperparameter kideal. All experiments are conducted with ktrain = 6 + 2. Effect of kideal on DeepSeekV2-Lite. We conduct an analysis of the hyperparameter kideal on the DeepSeekV2-Lite model to further validate the robustness of its configuration. The results in [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Performance evolution of EMoEco-act (i.e., only using co-activation sampling) versus the Top-k baseline at different training checkpoints. The subplots (a) through (d) show performance snapshots at the end of epochs 1, 2, 3, and 4, respectively. A.3 ALGORITHM OF EMOE Here, we present the complete algorithm of the proposed EMoE training framework in Algorithm ??. Algorithm 1 Elastic Mixture-of-Experts (EMoE… view at source ↗

read the original abstract

Mixture-of-Experts (MoE) models typically fix the number of activated experts $k$ at both training and inference. However, real-world deployments often face heterogeneous hardware, fluctuating workloads, and diverse quality-latency requirements, while training separate models for each scenario is costly. Considering that MoE models already operate with sparse activation, adjusting the number of activated experts offers a natural path to serving diverse budgets with a single model. Yet, we find that activating more experts $k'$ ($> k$) at inference does not yield the expected gains. Instead, performance degrades rapidly after only a slight increase, a phenomenon we term the \textit{inference-time scaling wall}. Further investigation reveals that this degradation stems from a lack of learned collaboration among experts. To address this, we introduce \textbf{Elastic Mixture-of-Experts (EMoE)}, a novel training framework that enables MoE models to elastically vary the number of activated experts at inference. By simultaneously training experts to collaborate in diverse combinations and encouraging the router to make high-quality selections, EMoE ensures robust performance across inference budgets. Extensive experiments across four MoE architectures (7B--21B) and nine benchmarks show that EMoE significantly expands the effective scaling range to 2-3$\times$ the training-time $k$, while also achieving higher peak performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Elastic MoE trains for varied expert counts at inference by mixing combinations during training, delivering broader scaling on multiple models and benchmarks, though the root-cause ablation is missing.

read the letter

The main thing to know is that this work trains MoE models so they keep performing well when you turn on more experts at inference than they saw in training. They achieve this by making the training process include lots of different expert group sizes and router signals for those cases. The contribution that stands out is the training framework itself. By forcing experts to collaborate in varied combinations and pushing the router toward better selections for those cases, they get models that scale inference more elastically. This is a step beyond earlier discussions of flexible k, because it includes a specific procedure to build in that robustness. The experiments are a strength here. Running across four architectures between 7B and 21B and nine different benchmarks gives a reasonable picture of where it helps. The reported outcome is that the useful scaling range extends to two or three times the training k, with some gains even at the base setting. The softer part is the explanation for the original problem. They say the quick degradation comes from experts not having learned to collaborate in new mixes. That makes sense on the surface, but the stress test raises a fair point: it could be the router getting thrown off because it only trained on the fixed k. The paper does not appear to have run the clean ablation that would hold the router constant and change only the collaboration training, or provide router quality metrics at the higher k values. That leaves some uncertainty about whether the full method is required or if router exposure alone would fix most of it. The work is straightforward empirical stuff with no hidden circularity in the claims. Citations hit the main MoE papers. This is for readers who build or tune large model serving stacks and need one checkpoint to handle different hardware or latency targets. Someone in that area could pull useful ideas or numbers from it. I would put it through peer review. The scope of the tests is broad enough to justify referee time, and the practical problem is worth addressing even if the mechanism needs sharpening in revision.

Referee Report

2 major / 1 minor

Summary. The paper identifies an 'inference-time scaling wall' in Mixture-of-Experts models, where activating k' > k experts at inference causes rapid performance degradation. It attributes this to insufficient learned collaboration among experts and introduces Elastic MoE (EMoE), a training framework that simultaneously optimizes experts for diverse combinations and improves router quality. Experiments across four MoE architectures (7B–21B) and nine benchmarks claim that EMoE expands the effective scaling range to 2–3× the training-time k while also raising peak performance.

Significance. If the central mechanism holds, EMoE would allow a single trained MoE model to serve heterogeneous hardware, workloads, and quality-latency trade-offs without retraining separate models for each k, reducing deployment overhead. The evaluation spans multiple architectures and benchmarks, which is a positive indicator of generality for an empirical method.

major comments (2)

[Abstract] Abstract: The diagnosis that degradation 'stems from a lack of learned collaboration among experts' is load-bearing for the EMoE objective. An alternative—that the router trained only on the original top-k produces mis-calibrated scores for larger sets—is not isolated. No ablation is described that holds the router fixed while varying only expert co-training, nor are router calibration metrics (e.g., rank correlation of router scores with held-out expert utility) reported at k' = 2k.
[Abstract] Abstract and experimental description: The claim of 'consistent gains' and '2-3× the training-time k' is presented without specifying exact baselines, ablation controls, statistical significance tests, or the precise metric used to quantify the inference-time scaling wall. These omissions directly affect assessment of whether the reported expansion is robust.

minor comments (1)

The abstract introduces the term 'inference-time scaling wall' without situating it against prior observations of MoE scaling behavior; a short related-work sentence would improve context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, clarifying our existing analyses where possible and outlining targeted revisions to improve precision and isolation of effects.

read point-by-point responses

Referee: [Abstract] Abstract: The diagnosis that degradation 'stems from a lack of learned collaboration among experts' is load-bearing for the EMoE objective. An alternative—that the router trained only on the original top-k produces mis-calibrated scores for larger sets—is not isolated. No ablation is described that holds the router fixed while varying only expert co-training, nor are router calibration metrics (e.g., rank correlation of router scores with held-out expert utility) reported at k' = 2k.

Authors: We appreciate the referee highlighting the importance of isolating the root cause. Section 4.2 of the manuscript already includes experiments that fix the router (using both the trained router and an oracle router) while varying expert training objectives, showing that performance degradation persists even with improved routing, which supports the collaboration hypothesis. However, to more rigorously rule out the mis-calibration alternative, we will add a new ablation that holds the router weights completely fixed from the baseline training run and retrains only the experts under the EMoE co-training objective. We will also report router calibration metrics, including rank correlation between router scores and held-out expert utilities, evaluated at k' = 2k. revision: partial
Referee: [Abstract] Abstract and experimental description: The claim of 'consistent gains' and '2-3× the training-time k' is presented without specifying exact baselines, ablation controls, statistical significance tests, or the precise metric used to quantify the inference-time scaling wall. These omissions directly affect assessment of whether the reported expansion is robust.

Authors: We agree that explicit specification of these elements is essential. In the revised manuscript we will: (1) state the exact baselines (standard MoE trained and evaluated at the original k, plus dense models of comparable size); (2) detail all ablation controls including the fixed-router and oracle-router variants; (3) report statistical significance via paired t-tests over three random seeds for key results; and (4) define the scaling-wall metric as the largest k' at which average performance across benchmarks remains within 1% of the model's peak accuracy (or does not fall below the single-expert baseline). These clarifications will be added to both the abstract and the experimental section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training framework with independent experimental validation

full rationale

The paper presents an entirely empirical contribution: it observes an inference-time scaling wall in MoE models, attributes it to insufficient expert collaboration via investigation, and proposes EMoE as a training procedure to expose experts to diverse combinations. No equations, derivations, or fitted parameters are shown to reduce the claimed gains (2-3× scaling range, higher peak performance) to quantities defined by construction from the inputs. The central claims rest on experiments across four architectures and nine benchmarks rather than self-referential definitions or load-bearing self-citations. The diagnosis and remedy are falsifiable outside the paper's own fitted values and do not collapse into tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that expert collaboration can be learned through multi-combination training and that this transfers to unseen inference-time k values; no new free parameters or invented physical entities are introduced beyond the standard MoE router and expert parameters.

axioms (1)

domain assumption Training experts on diverse activation combinations during training produces collaboration that generalizes to inference-time changes in k.
Invoked to explain why simply increasing k fails and why the new training framework succeeds.

invented entities (1)

Elastic Mixture-of-Experts (EMoE) training framework no independent evidence
purpose: Enable elastic variation of activated experts at inference while preserving performance
New training procedure introduced to address the inference-time scaling wall; no independent falsifiable prediction outside the reported experiments is provided.

pith-pipeline@v0.9.0 · 5810 in / 1335 out tokens · 45989 ms · 2026-05-18T14:14:59.010207+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

performance degradation stems from a lack of learned collaboration among experts... stochastic co-activation sampling... hierarchical router loss LHR = −DKL(h(x)∥U)
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

when a model is trained with k experts, the effective scaling range at inference is so narrow

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
cs.LG 2026-05 unverdicted novelty 7.0

MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.
Elastic Attention Cores for Scalable Vision Transformers
cs.CV 2026-05 unverdicted novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...
MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
cs.LG 2026-05 unverdicted novelty 6.0

ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.
Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start
cs.DC 2026-04 unverdicted novelty 6.0

Foundry uses template-based CUDA graph context materialization to reduce LLM serving cold-start latency by up to 99% while preserving CUDA graph throughput gains.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 3 Pith papers · 13 internal anchors

[1]

Mixture-of-recursions: Learning dynamic recur- sive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524,

Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, and Se - Young Yun. Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation. CoRR, abs/2507.10524, 2025. doi:10.48550/ARXIV.2507.10524. URL https://doi.org/10.48550/arXiv.2507.10524

work page doi:10.48550/arxiv.2507.10524 2025
[2]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond \' e de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bava...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Inner thinking transformer: Leveraging dynamic depth scaling to foster adaptive internal thinking

Yilong Chen, Junyuan Shang, Zhenyu Zhang, Yanxi Xie, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, and Haifeng Wang. Inner thinking transformer: Leveraging dynamic depth scaling to foster adaptive internal thinking. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of ...

work page 2025
[4]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Opencompass: A universal evaluation platform for foundation models

OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023

work page 2023
[7]

Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. In Lun - Wei Ku, Andre Martins, and Vivek Srikumar (eds.)...

work page doi:10.18653/v1/2024.acl-long.70 2024
[8]

DeepSeek - AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, Hao Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.04434 2024
[9]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Dhillon, Yulia Tsvetkov, Hanna Hajishirzi, Sham M

Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit S. Dhillon, Yulia Tsvetkov, Hanna Hajishirzi, Sham M. Kakade, Ali Farhadi, and Prateek Jain. Matformer: Nested transformer for elastic inference. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances...

work page 2024
[11]

Loramoe: Revolutionizing mixture of experts for maintaining world knowledge in language model alignment, 2023

Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Jun Zhao, Wei Shen, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Xiaoran Fan, Shiliang Pu, Jiang Zhu, Rui Zheng, Tao Gui, Qi Zhang, and Xuanjing Huang. Loramoe: Revolutionizing mixture of experts for maintaining world knowledge in language model alignment, 2023

work page 2023
[12]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res., 23: 0 120:1--120:39, 2022. URL https://jmlr.org/papers/v23/21-0998.html

work page 2022
[13]

Hydravit: Stacking heads for a scalable vit

Janek Haberer, Ali Hojjat, and Olaf Landsiedel. Hydravit: Stacking heads for a scalable vit. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouve...

work page 2024
[14]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

work page 2021
[15]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

work page 2022
[16]

Harder task needs more experts: Dynamic routing in M o E models

Quzhe Huang, Zhenwei An, Nan Zhuang, Mingxu Tao, Chen Zhang, Yang Jin, Kun Xu, Kun Xu, Liwei Chen, Songfang Huang, and Yansong Feng. Harder task needs more experts: Dynamic routing in M o E models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Lon...

work page doi:10.18653/v1/2024.acl-long.696 2024
[17]

Upcycling instruction tuning from dense to mixture-of-experts via parameter merging

Tingfeng Hui, Zhenyu Zhang, Shuohuan Wang, Yu Sun, Hua Wu, and Sen Su. Upcycling instruction tuning from dense to mixture-of-experts via parameter merging. CoRR, abs/2410.01610, 2024. doi:10.48550/ARXIV.2410.01610. URL https://doi.org/10.48550/arXiv.2410.01610

work page doi:10.48550/arxiv.2410.01610 2024
[18]

Adaptive Mixtures of Local Experts

Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts. Neural Comput., 3 0 (1): 0 79--87, 1991. doi:10.1162/NECO.1991.3.1.79. URL https://doi.org/10.1162/neco.1991.3.1.79

work page doi:10.1162/neco.1991.3.1.79 1991
[19]

Moe++: Accelerating mixture-of-experts methods with zero-computation experts

Peng Jin, Bo Zhu, Li Yuan, and Shuicheng Yan. Moe++: Accelerating mixture-of-experts methods with zero-computation experts. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=t7P5BUKcYv

work page 2025
[20]

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min - Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Pape...

work page doi:10.18653/v1/p17-1147 2017
[21]

Lost in the Middle: How Language Models Use Long Contexts

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur P. Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming - Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question answering research. Tr...

work page internal anchor Pith review doi:10.1162/tacl 2019
[22]

Gshard: Scaling giant models with conditional computation and automatic sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021. URL http...

work page 2021
[23]

Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification, 2023

Wing Lian, Guan Wang, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong, and "Teknium". Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification, 2023. URL https://https://huggingface.co/Open-Orca/SlimOrca

work page 2023
[24]

OLMoE: Open Mixture-of-Experts Language Models

Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, and Hannaneh Hajishirzi. O...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:266362871

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

From sparse to soft mixtures of experts

Joan Puigcerver, Carlos Riquelme Ruiz, Basil Mustafa, and Neil Houlsby. From sparse to soft mixtures of experts. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=jxpsAj7ltE

work page 2024
[27]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Art...

work page doi:10.1609/aaai.v34i05.6399 2020
[28]

Le, Geoffrey E

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings . OpenReview.net, 2017. URL h...

work page 2017
[29]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters. CoRR, abs/2408.03314, 2024. doi:10.48550/ARXIV.2408.03314. URL https://doi.org/10.48550/arXiv.2408.03314

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.03314 2024
[30]

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Ha...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Qwen1.5-moe: Matching 7b model performance with 1/3 activated parameters", February 2024

Qwen Team. Qwen1.5-moe: Matching 7b model performance with 1/3 activated parameters", February 2024. URL https://qwenlm.github.io/blog/qwen-moe/

work page 2024
[32]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023 a . URL https://api.semanticscholar...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton - Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Har...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023
[34]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference o...

work page 2017
[35]

An Wang, Xingwu Sun, Ruobing Xie, Shuaipeng Li, Jiaqi Zhu, Zhen Yang, Pinxue Zhao, J. N. Han, Zhanhui Kang, Di Wang, Naoaki Okazaki, and Cheng - Zhong Xu. Hmoe: Heterogeneous mixture of experts for language modeling. CoRR, abs/2408.10681, 2024 a . doi:10.48550/ARXIV.2408.10681. URL https://doi.org/10.48550/arXiv.2408.10681

work page doi:10.48550/arxiv.2408.10681 2024
[36]

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts. CoRR, abs/2408.15664, 2024 b . doi:10.48550/ARXIV.2408.15664. URL https://doi.org/10.48550/arXiv.2408.15664

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.15664 2024
[37]

Remoe: Fully differentiable mixture-of-experts with relu routing

Ziteng Wang, Jun Zhu, and Jianfei Chen. Remoe: Fully differentiable mixture-of-experts with relu routing. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=4D0f16Vwc3

work page 2025
[38]

Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Source code is all you need. CoRR, abs/2312.02120, 2023. doi:10.48550/ARXIV.2312.02120. URL https://doi.org/10.48550/arXiv.2312.02120

work page doi:10.48550/arxiv.2312.02120 2023
[39]

Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https:...

work page 2024
[40]

URL https:// doi.org/10.18653/v1/p19-1472

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Anna Korhonen, David R. Traum, and Llu \' s M \` a rquez (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers , pp...

work page doi:10.18653/v1/p19-1472 2019
[41]

Adamoe: Token-adaptive routing with null experts for mixture-of-experts language models

Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, and Zhijie Deng. Adamoe: Token-adaptive routing with null experts for mixture-of-experts language models. In Yaser Al - Onaizan, Mohit Bansal, and Yun - Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-16, 2024 , pp.\ 6223--6235. Assoc...

work page doi:10.18653/v1/2024.findings-emnlp.361 2024
[42]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[43]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[44]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[45]

-2.8pt 2.9mm tabular @ lcccc@ & k'=1 & k'=2 & k'=4 & k'=6 \\ Top- k & 45.48 & 47.88 & 48.05 & 47.57 \\ EMoE & 45.68 & 48.22 & 49.00 & 49.50 \\ \ \ \ w/o co-act

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv

[1] [1]

Mixture-of-recursions: Learning dynamic recur- sive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524,

Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, and Se - Young Yun. Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation. CoRR, abs/2507.10524, 2025. doi:10.48550/ARXIV.2507.10524. URL https://doi.org/10.48550/arXiv.2507.10524

work page doi:10.48550/arxiv.2507.10524 2025

[2] [2]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond \' e de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bava...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

Inner thinking transformer: Leveraging dynamic depth scaling to foster adaptive internal thinking

Yilong Chen, Junyuan Shang, Zhenyu Zhang, Yanxi Xie, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, and Haifeng Wang. Inner thinking transformer: Leveraging dynamic depth scaling to foster adaptive internal thinking. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of ...

work page 2025

[4] [4]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

Opencompass: A universal evaluation platform for foundation models

OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023

work page 2023

[7] [7]

Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. In Lun - Wei Ku, Andre Martins, and Vivek Srikumar (eds.)...

work page doi:10.18653/v1/2024.acl-long.70 2024

[8] [8]

DeepSeek - AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, Hao Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.04434 2024

[9] [9]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Dhillon, Yulia Tsvetkov, Hanna Hajishirzi, Sham M

Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit S. Dhillon, Yulia Tsvetkov, Hanna Hajishirzi, Sham M. Kakade, Ali Farhadi, and Prateek Jain. Matformer: Nested transformer for elastic inference. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances...

work page 2024

[11] [11]

Loramoe: Revolutionizing mixture of experts for maintaining world knowledge in language model alignment, 2023

Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Jun Zhao, Wei Shen, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Xiaoran Fan, Shiliang Pu, Jiang Zhu, Rui Zheng, Tao Gui, Qi Zhang, and Xuanjing Huang. Loramoe: Revolutionizing mixture of experts for maintaining world knowledge in language model alignment, 2023

work page 2023

[12] [12]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res., 23: 0 120:1--120:39, 2022. URL https://jmlr.org/papers/v23/21-0998.html

work page 2022

[13] [13]

Hydravit: Stacking heads for a scalable vit

Janek Haberer, Ali Hojjat, and Olaf Landsiedel. Hydravit: Stacking heads for a scalable vit. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouve...

work page 2024

[14] [14]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

work page 2021

[15] [15]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

work page 2022

[16] [16]

Harder task needs more experts: Dynamic routing in M o E models

Quzhe Huang, Zhenwei An, Nan Zhuang, Mingxu Tao, Chen Zhang, Yang Jin, Kun Xu, Kun Xu, Liwei Chen, Songfang Huang, and Yansong Feng. Harder task needs more experts: Dynamic routing in M o E models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Lon...

work page doi:10.18653/v1/2024.acl-long.696 2024

[17] [17]

Upcycling instruction tuning from dense to mixture-of-experts via parameter merging

Tingfeng Hui, Zhenyu Zhang, Shuohuan Wang, Yu Sun, Hua Wu, and Sen Su. Upcycling instruction tuning from dense to mixture-of-experts via parameter merging. CoRR, abs/2410.01610, 2024. doi:10.48550/ARXIV.2410.01610. URL https://doi.org/10.48550/arXiv.2410.01610

work page doi:10.48550/arxiv.2410.01610 2024

[18] [18]

Adaptive Mixtures of Local Experts

Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts. Neural Comput., 3 0 (1): 0 79--87, 1991. doi:10.1162/NECO.1991.3.1.79. URL https://doi.org/10.1162/neco.1991.3.1.79

work page doi:10.1162/neco.1991.3.1.79 1991

[19] [19]

Moe++: Accelerating mixture-of-experts methods with zero-computation experts

Peng Jin, Bo Zhu, Li Yuan, and Shuicheng Yan. Moe++: Accelerating mixture-of-experts methods with zero-computation experts. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=t7P5BUKcYv

work page 2025

[20] [20]

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min - Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Pape...

work page doi:10.18653/v1/p17-1147 2017

[21] [21]

Lost in the Middle: How Language Models Use Long Contexts

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur P. Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming - Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question answering research. Tr...

work page internal anchor Pith review doi:10.1162/tacl 2019

[22] [22]

Gshard: Scaling giant models with conditional computation and automatic sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021. URL http...

work page 2021

[23] [23]

Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification, 2023

Wing Lian, Guan Wang, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong, and "Teknium". Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification, 2023. URL https://https://huggingface.co/Open-Orca/SlimOrca

work page 2023

[24] [24]

OLMoE: Open Mixture-of-Experts Language Models

Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, and Hannaneh Hajishirzi. O...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:266362871

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

From sparse to soft mixtures of experts

Joan Puigcerver, Carlos Riquelme Ruiz, Basil Mustafa, and Neil Houlsby. From sparse to soft mixtures of experts. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=jxpsAj7ltE

work page 2024

[27] [27]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Art...

work page doi:10.1609/aaai.v34i05.6399 2020

[28] [28]

Le, Geoffrey E

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings . OpenReview.net, 2017. URL h...

work page 2017

[29] [29]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters. CoRR, abs/2408.03314, 2024. doi:10.48550/ARXIV.2408.03314. URL https://doi.org/10.48550/arXiv.2408.03314

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.03314 2024

[30] [30]

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Ha...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Qwen1.5-moe: Matching 7b model performance with 1/3 activated parameters", February 2024

Qwen Team. Qwen1.5-moe: Matching 7b model performance with 1/3 activated parameters", February 2024. URL https://qwenlm.github.io/blog/qwen-moe/

work page 2024

[32] [32]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023 a . URL https://api.semanticscholar...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton - Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Har...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023

[34] [34]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference o...

work page 2017

[35] [35]

An Wang, Xingwu Sun, Ruobing Xie, Shuaipeng Li, Jiaqi Zhu, Zhen Yang, Pinxue Zhao, J. N. Han, Zhanhui Kang, Di Wang, Naoaki Okazaki, and Cheng - Zhong Xu. Hmoe: Heterogeneous mixture of experts for language modeling. CoRR, abs/2408.10681, 2024 a . doi:10.48550/ARXIV.2408.10681. URL https://doi.org/10.48550/arXiv.2408.10681

work page doi:10.48550/arxiv.2408.10681 2024

[36] [36]

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts. CoRR, abs/2408.15664, 2024 b . doi:10.48550/ARXIV.2408.15664. URL https://doi.org/10.48550/arXiv.2408.15664

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.15664 2024

[37] [37]

Remoe: Fully differentiable mixture-of-experts with relu routing

Ziteng Wang, Jun Zhu, and Jianfei Chen. Remoe: Fully differentiable mixture-of-experts with relu routing. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=4D0f16Vwc3

work page 2025

[38] [38]

Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Source code is all you need. CoRR, abs/2312.02120, 2023. doi:10.48550/ARXIV.2312.02120. URL https://doi.org/10.48550/arXiv.2312.02120

work page doi:10.48550/arxiv.2312.02120 2023

[39] [39]

Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https:...

work page 2024

[40] [40]

URL https:// doi.org/10.18653/v1/p19-1472

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Anna Korhonen, David R. Traum, and Llu \' s M \` a rquez (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers , pp...

work page doi:10.18653/v1/p19-1472 2019

[41] [41]

Adamoe: Token-adaptive routing with null experts for mixture-of-experts language models

Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, and Zhijie Deng. Adamoe: Token-adaptive routing with null experts for mixture-of-experts language models. In Yaser Al - Onaizan, Mohit Bansal, and Yun - Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-16, 2024 , pp.\ 6223--6235. Assoc...

work page doi:10.18653/v1/2024.findings-emnlp.361 2024

[42] [42]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[43] [43]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[44] [44]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[45] [45]

-2.8pt 2.9mm tabular @ lcccc@ & k'=1 & k'=2 & k'=4 & k'=6 \\ Top- k & 45.48 & 47.88 & 48.05 & 47.57 \\ EMoE & 45.68 & 48.22 & 49.00 & 49.50 \\ \ \ \ w/o co-act

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv