arxiv: 2605.11537 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: no theorem link

Fast MoE Inference via Predictive Prefetching and Expert Replication

Ankit Jyothish , Ali Jannesari , Aishwarya Sarkar , Joseph Zuber

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:20 UTC · model grok-4.3

classification 💻 cs.LG

keywords mixture of expertsmoe inferenceexpert replicationpredictive prefetchinggpu utilizationload imbalancelarge language modelsinference optimization

0 comments

The pith

Predicting overloaded experts and replicating them lets MoE models run tokens in parallel and reach near-full GPU use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method that forecasts which experts will receive too many tokens in the next batch and creates temporary copies of those experts. The copies process different tokens at the same time across model layers, cutting idle GPU time that normally occurs from sparse expert activation in large language models. If the forecasts are reliable, this raises utilization to roughly 100 percent and delivers up to three times faster inference. The approach is tested on Switch-base models and keeps 90 to 95 percent of the original accuracy.

Core claim

The authors establish that a dynamic expert replication strategy, driven by predictions of which experts will be overloaded, allows replicated experts to handle batch tokens concurrently across layers. This produces near-complete GPU utilization, up to 3x faster inference, and retention of 90-95 percent of baseline performance on Switch-base-128 and Switch-base-256 models.

What carries the argument

Dynamic expert replication strategy that predicts overloaded experts and duplicates them for concurrent processing.

If this is right

GPU utilization reaches approximately 100 percent.
Inference speed increases by up to 3 times over the baseline.
Model performance remains at 90-95 percent of the unreplicated version.
Replicated experts process tokens concurrently to shorten idle periods.
The method scales to large MoE models such as Switch-base-128 and Switch-base-256.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prediction-plus-replication pattern might reduce latency in other sparse-activation networks that suffer load imbalance.
Combining the replication decisions with hardware-aware scheduling could further cut memory overhead on specific GPUs.
Over time, accurate expert-usage predictors may become a standard runtime component rather than a training-time concern.
The approach leaves open whether replication decisions can be learned jointly with the base model to improve both accuracy and speed.

Load-bearing premise

The prediction model must be accurate enough on real workloads that the cost of creating and running replicated experts stays smaller than the gains from extra parallelism.

What would settle it

Measure inference time and GPU utilization on a new MoE model and workload where the overload predictor is replaced with random guesses; if speed falls below the unreplicated baseline, the claim does not hold.

Figures

Figures reproduced from arXiv: 2605.11537 by Aishwarya Sarkar, Ali Jannesari, Ankit Jyothish, Joseph Zuber.

**Figure 1.** Figure 1: Issues in MoE-based SwitchTransformers (MultiRC SuperGlue dataset) Despite these benefits, MoE architectures face substantial challenges during inference, particularly in latency, inefficient memory usage, and suboptimal GPU utilization. In practice, MoE inference can be substantially slower—up to 15 times for language modeling and approximately three times slower for machine translation—than equivalent d… view at source ↗

**Figure 2.** Figure 2: SwitchTransformers MoE Layer with all experts, [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: SiDA-MoE MoE Layer with only predicted active [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 4.** Figure 4: MoE-MPMC(Ours) MoE Layer with predicted active [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Workflow of Fast MoE Inference using MoE-MPMC. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of accuracies between SwitchTransformers, SiDA-MoE and MoE-MPMC. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of throughput between SwitchTransformers, SiDA-MoE and MoE-MPMC. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of finetuning time between SwitchTransformers, SiDA-MoE and MoE-MPMC. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

read the original abstract

The Mixture of Experts (MoE) architecture has become a fundamental building block in state-of-the-art large language models (LLMs), improving domain-specific expertise in LLMs and scaling model capacity without proportionally increasing their computational overhead. However, MoE inference often suffers from suboptimal GPU utilization, load imbalance, and elevated latency arising from multiple tokens waiting on the same experts for their computation which arises from sparsity of expert activation. To address these challenges, we propose a dynamic expert replication strategy that predicts which experts are likely to be overloaded and replicates them for upcoming batches of tokens. The replicated experts process batch tokens concurrently across layers, which leads to improved parallelism, shorter GPU idle time, and significantly faster inference. Experimental evaluations conducted on large-scale MoE models, including Switch-base-128 and Switch-base-256, demonstrate that our method achieves near-complete GPU utilization (approx 100%), leading to upto 3x improvement in inference speed while preserving approximately 90-95% of the performance of baseline architectures

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims up to 3x MoE inference speedups from predictive expert replication but the abstract gives no numbers on predictor accuracy or replication overhead.

read the letter

The main takeaway is that this work proposes predicting overloaded experts in MoE models and replicating them on the fly to improve GPU utilization during inference. They report up to 3x speedups on Switch-base models while keeping accuracy close to the original. What is new here is the specific use of a predictive model to decide on expert replication ahead of batches, combined with prefetching to reduce idle time from load imbalance. This is not just another load balancer; it tries to anticipate the sparsity pattern and duplicate compute resources where needed. The paper handles the motivation well by linking the problem directly to how top-k routing creates hot spots. Testing on named large models like Switch-base-128 and 256 gives it some grounding, and the headline numbers on utilization and latency are easy to grasp. The soft spots come down to missing details around the prediction step. The abstract does not report how accurate the overloaded expert predictor is, what threshold they use for replication, or how they account for the extra memory and potential cache misses from copies. Without those, it is difficult to judge whether the gains are robust or depend on particular workloads where the predictor happens to work well. The 90-95% performance preservation is also stated without saying if that is accuracy, perplexity, or something else, and there is no comparison to simpler replication strategies. This kind of paper is aimed at engineers and researchers focused on making large sparse models run faster in practice. Someone implementing an MoE inference system could pick up the idea and try it out, even if they have to design their own predictor. I would recommend sending it for peer review. The core technique addresses a genuine bottleneck, and referees can push for the quantitative validation on the prediction overhead that is currently absent.

Referee Report

2 major / 0 minor

Summary. The paper proposes a dynamic expert replication strategy for Mixture of Experts (MoE) inference that predicts overloaded experts and replicates them to enable concurrent processing across layers. It reports experimental results on Switch-base-128 and Switch-base-256 models claiming near-100% GPU utilization, up to 3x inference speedup, and preservation of 90-95% of baseline performance.

Significance. If the empirical claims are substantiated with detailed measurements, the technique could provide a practical systems optimization for reducing latency and improving utilization in large MoE deployments without requiring architectural changes to the models.

major comments (2)

[Abstract] Abstract: the central claims of near-complete GPU utilization and up to 3x speedup are stated without any reported metrics on prediction accuracy, false-positive rates for replication decisions, or quantitative replication overhead (extra memory, communication, cache effects), which are required to confirm that overhead remains smaller than parallelism gains.
[Abstract] Abstract: no details are provided on baseline implementations, how replication decisions are made per layer or batch, the replication decision threshold (listed as a free parameter), or statistical significance of the reported speedups, leaving the robustness of the 3x claim and 90-95% performance preservation unclear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their valuable feedback. We provide point-by-point responses to the major comments and will make revisions to the abstract as appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of near-complete GPU utilization and up to 3x speedup are stated without any reported metrics on prediction accuracy, false-positive rates for replication decisions, or quantitative replication overhead (extra memory, communication, cache effects), which are required to confirm that overhead remains smaller than parallelism gains.

Authors: We agree that the abstract does not include these specific supporting metrics. The manuscript body provides analysis of the prediction model, replication decisions, and associated overheads. We will revise the abstract to incorporate key quantitative results on prediction accuracy, false-positive rates, and replication overhead to better substantiate the central claims. revision: yes
Referee: [Abstract] Abstract: no details are provided on baseline implementations, how replication decisions are made per layer or batch, the replication decision threshold (listed as a free parameter), or statistical significance of the reported speedups, leaving the robustness of the 3x claim and 90-95% performance preservation unclear.

Authors: The baseline implementation is the standard inference procedure for the Switch Transformer models, as described in the methods section. Replication decisions are made on a per-layer and per-batch basis using the predictive model with a tunable threshold parameter. Statistical significance is addressed through multiple experimental runs. We will update the abstract to include brief descriptions of the baseline, decision-making process, and robustness measures. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical systems technique

full rationale

The paper describes an empirical dynamic expert replication strategy for MoE inference optimization, validated experimentally on external checkpoints such as Switch-base-128 and Switch-base-256. No equations, derivations, or first-principles results are presented that reduce the claimed GPU utilization or speedups to fitted parameters or self-referential definitions by construction. The approach relies on practical prediction of overloaded experts and replication decisions, with performance claims grounded in measured outcomes rather than tautological inputs. This matches the default expectation for non-circular empirical systems work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the standard MoE assumption of sparse expert activation causing imbalance and introduces a new predictive replication heuristic whose parameters are not detailed in the abstract.

free parameters (1)

replication decision threshold
Hyperparameter controlling when an expert is deemed overloaded enough to warrant replication; value not reported in abstract.

axioms (1)

domain assumption Sparse expert activation in MoE leads to load imbalance and GPU idle time
Invoked in the opening problem statement as the root cause of suboptimal inference.

pith-pipeline@v0.9.0 · 5480 in / 1157 out tokens · 47416 ms · 2026-05-13T02:20:57.251141+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

[1]

Dolan and Chris Brockett

William B. Dolan and Chris Brockett. 2005. Automatically Constructing a Corpus of Sentential Paraphrases. InProceedings of the Third International Workshop on Paraphrasing (IWP2005). https://aclanthology.org/I05-5002/

work page 2005
[2]

Zhixu Du, Shiyu Li, Yuhao Wu, Xiangyu Jiang, Jingwei Sun, Qilin Zheng, Yongkai Wu, Ang Li, Hai Helen Li, and Yiran Chen. 2024. SiDA: Sparsity-Inspired Data- Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models. In Proceedings of Machine Learning and Systems, P. Gibbons, G. Pekhimenko, and C. De Sa (Eds.), Vol. 6. 224–238. https://proce...

work page 2024
[3]

William Fedus, Jeff Dean, and Barret Zoph. 2022. A Review of Sparse Expert Models in Deep Learning. arXiv:2209.01667 [cs.LG] https://arxiv.org/abs/2209. 01667

work page arXiv 2022
[4]

Yongxin Guo, Zhenglin Cheng, Xiaoying Tang, Zhaopeng Tu, and Tao Lin. 2025. Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Trans- former Models. InThe Thirteenth International Conference on Learning Represen- tations. https://openreview.net/forum?id=T26f9z2rEe

work page 2025
[5]

Jiaao He, Jiezhong Qiu, Aohan Zeng, Zhilin Yang, Jidong Zhai, and Jie Tang. 2021. FastMoE: A Fast Mixture-of-Expert Training System. arXiv:2103.13262 [cs.LG] https://arxiv.org/abs/2103.13262

work page arXiv 2021
[6]

Xu Owen He. 2024. Mixture of A Million Experts. arXiv:2407.04153 [cs.LG] https://arxiv.org/abs/2407.04153

work page arXiv 2024
[7]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput.9, 8 (Nov. 1997), 1735–1780. doi:10.1162/neco.1997.9.8.1735

work page doi:10.1162/neco.1997.9.8.1735 1997
[8]

Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, HoYuen Chau, Peng Cheng, Fan Yang, Mao Yang, and Yongqiang Xiong. 2023. Tutel: Adaptive Mixture-of-Experts at Scale. InProceedings of Machine Learning and Systems, D. Song, M. Carbin, and T. Chen (Eds.), Vol. 5. Curan, 269–287. https://pro...

work page 2023
[9]

Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, and Mao Yang. 2024. Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference . In2024 ACM/IEEE 51st An- nual International Symposium on Computer Architecture (ISCA). IEEE Computer Society, Los Alamitos, CA, USA, 1018–1031. doi:10.1109/ISCA5907...

work page doi:10.1109/isca59077.2024.00078 2024
[10]

Peng Jin, Bo Zhu, Li Yuan, and Shuicheng YAN. 2025. MoH: Multi-Head Attention as Mixture-of-Head Attention. https://openreview.net/forum?id=VOVFvaxgD0

work page 2025
[11]

Wang, Hui Dai, and Yoav Artzi

Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, and Yoav Artzi. 2018. Simple Recurrent Units for Highly Parallelizable Recurrence. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, Brussels, Belgium, 4470–...

work page doi:10.18653/v1/d18-1477 2018
[12]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. {GS}hard: Scaling Giant Models with Conditional Computation and Automatic Sharding. InInternational Conference on Learning Representations. https://openreview.net/ forum?id=qrwe7XHTmYb

work page 2021
[13]

Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng- Ann Heng, Minyi Guo, and Chao Li. 2025. A Survey on Inference Optimization Techniques for Mixture of Experts Models. arXiv:2412.14219 [cs.LG] https: //arxiv.org/abs/2412.14219

work page arXiv 2025
[14]

Xin Lu, Yanyan Zhao, Bing Qin, Liangyu Huo, Qing Yang, and Dongliang Xu

work page
[15]

InThe Thirty-eighth Annual Conference on Neural Information Processing Systems

How does Architecture Influence the Base Capabilities of Pre-trained Language Models? A Case Study Based on FFN-Wider and MoE Transformers. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=67tRrjgzsh

work page
[16]

Alexandre Muzio, Alex Sun, and Churan He. 2024. SEER-MoE: Sparse Expert Effi- ciency through Regularization for Mixture-of-Experts. arXiv:2404.05089 [cs.CL] https://arxiv.org/abs/2404.05089

work page arXiv 2024
[17]

Xiaonan Nie, Xupeng Miao, Zilong Wang, Zichao Yang, Jilong Xue, Lingxiao Ma, Gang Cao, and Bin Cui. 2023. FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement.Proc. ACM Manag. Data1, 1, Article 110 (May 2023), 19 pages. doi:10.1145/3588964

work page doi:10.1145/3588964 2023
[18]

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. 2022. DeepSpeed- MoE: Advancing Mixture-of-Experts Inference and Training to Power Next- Generation AI Scale. InProceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol...

work page 2022
[19]

Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. InInternational Conference on Learning Representations. https://openreview.net/forum?id=B1ckMDqlg

work page 2017
[20]

Yikang Shen, Zhen Guo, Tianle Cai, and Zengyi Qin. 2024. JetMoE: Reaching Llama2 Performance with 0.1M Dollars. arXiv:2404.07413 [cs.CL] https://arxiv. org/abs/2404.07413

work page arXiv 2024
[21]

Manning, Andrew Ng, and Christopher Potts

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. InProceedings of the 2013 Con- ference on Empirical Methods in Natural Language Processing, David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and ...

work page 2013
[22]

Peng Tang, Jiacheng Liu, Xiaofeng Hou, Yifei Pu, Jing Wang, Pheng-Ann Heng, Chao Li, and Minyi Guo. 2024. HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference. arXiv:2411.01433 [cs.LG] https://arxiv.org/abs/ 2411.01433

work page arXiv 2024
[23]

Yuanxin Wei, Jiangsu Du, Jiazhi Jiang, Xiao Shi, Xianwei Zhang, Dan Huang, Nong Xiao, and Yutong Lu. 2024. APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes. InProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis(Atlanta, GA, USA)(SC ’24). IEEE Press, Article 9...

work page arXiv 2024
[24]

Rongjie Yi, Liwei Guo, Shiyun Wei, Ao Zhou, Shangguang Wang, and Mengwei Xu. 5555. EdgeMoE: Empowering Sparse Large Language Models on Mobile Devices .IEEE Transactions on Mobile Computing01 (Feb. 5555), 1–16. doi:10. 1109/TMC.2025.3546466

work page arXiv 2025
[25]

Yuping Yuan, Zhao You, Shulin Feng, Dan Su, Yanchun Liang, Xiaohu Shi, and Dong Yu. 2023. Compressed MoE ASR Model Based on Knowledge Distillation and Quantization. InInterspeech 2023. 3337–3341. doi:10.21437/Interspeech.2023- 2544

work page doi:10.21437/interspeech.2023- 2023
[26]

Ted Zadouri, Ahmet Üstün, Arash Ahmadian, Beyza Ermis, Acyr Locatelli, and Sara Hooker. 2024. Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=EvDeiLv7qc

work page 2024
[27]

Xing, Joseph E

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yan- ping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, and Ion Stoica. 2022. Alpa: Automating Inter- and Intra-Operator Par- allelism for Distributed Deep Learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX ...

work page 2022
[28]

Shuzhang Zhong, Ling Liang, Yuan Wang, Runsheng Wang, Ru Huang, and Meng Li. 2024. AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference. InProceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design (ICCAD ’24). ACM, 1–9. doi:10.1145/ 3676536.3676741

work page arXiv 2024