arxiv: 2605.02960 · v1 · submitted 2026-05-03 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

Zhaoyuan Su , Olatunji Ruwase , Karthik Ganesan , Aurick Qiao , Samyam Rajbhandari , Juncheng Yang , Yue Cheng , Yuxiong He

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:25 UTC · model grok-4.3

classification 💻 cs.LG

keywords mixture of expertsprefill servingasynchronous expert parallelismLLM inferencedistributed systemsmodel parallelismthroughput optimizationexpert routing

0 comments

The pith

ZeRO-Prefill replaces per-layer activation AllToAll with fully overlapped asynchronous expert weight AllGather to remove redundancy in MoE prefill serving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that existing tensor, expert, and pipeline parallelism strategies for MoE models impose redundant computation, communication, and synchronization costs when used for prefill-only workloads such as classification and recommendation. These costs trace back to the design choice of tying expert placement to synchronous activation routing, which was carried over from autoregressive decoding. Because large-batch prefill passes are long and compute-bound, a per-layer window exists to stream expert weights in the background instead. ZeRO-Prefill implements this via its AsyncEP backend and a prefix-aware frontend that tracks true FLOPs to enforce saturation thresholds. The approach yields concrete throughput gains while raising per-GPU model utilization on large MoE models.

Core claim

The overheads in MoE prefill stem from coupling expert placement with synchronous activation routing; these can be eliminated by gathering experts via weight AllGather that is fully overlapped with computation in long, compute-bound prefill layers, implemented in AsyncEP together with prefix-aware routing and true-FLOPs load tracking.

What carries the argument

AsyncEP (Asynchronous Expert Parallelism), which gathers experts by weight AllGather overlapped with computation rather than routing activations synchronously.

If this is right

Delivers 1.35-1.37x throughput over the strongest distributed baseline on real-world prefill workloads.
Reaches up to 1.59x throughput on long-context synthetic workloads.
Sustains 29.8-36.2% per-GPU model FLOPs utilization across four hardware and precision configurations.
Enables efficient prefill-only serving of discriminative tasks on models such as Qwen3-235B-A22B without redundant activation communication.
Removes the need for per-layer activation AllToAll in prefill by shifting to background weight streaming.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same overlap principle could be tested in other long compute phases of inference if hardware bandwidth allows full hiding of weight movement.
Hardware designers might reduce emphasis on AllToAll-optimized interconnects for prefill-dominant MoE deployments if the weight-streaming approach scales.
Future MoE training runs could incorporate the saturation threshold logic to produce models that are easier to serve under prefill-only patterns.

Load-bearing premise

The compute-bound forward passes of large-batch prefill last long enough for the full expert weight AllGather to complete in the background without becoming a new latency bottleneck.

What would settle it

Measure the wall-clock time of a full expert-weight AllGather on the target GPUs and compare it directly to the per-layer compute time observed in a large-batch prefill run; if AllGather time exceeds the available overlap window, the claimed throughput gains should vanish.

Figures

Figures reproduced from arXiv: 2605.02960 by Aurick Qiao, Juncheng Yang, Karthik Ganesan, Olatunji Ruwase, Samyam Rajbhandari, Yue Cheng, Yuxiong He, Zhaoyuan Su.

**Figure 1.** Figure 1: Prefill-only workloads dominate LLM serving input view at source ↗

**Figure 2.** Figure 2: MoE model size vs. per-GPU HBM capacity across view at source ↗

**Figure 4.** Figure 4: Expert routing imbalance of Qwen3-30B-A3B on view at source ↗

**Figure 5.** Figure 5: ZeRO-Prefill system architecture and end-to-end prefill-only serving workflow. The frontend normalizes incoming tasks into prefill-only form and schedules them into saturation-bounded batches with prefix affinity; the backend executes each batch under data-parallel attention and asynchronous expert streaming, returning logits without entering any decoding loop. Frontend Scheduler (Router) GPU 0 GPU 1 GPU 2… view at source ↗

**Figure 6.** Figure 6: Conventional synchronous DP+EP with four GPUs. view at source ↗

**Figure 7.** Figure 7: AsyncEP execution models with four GPUs. (a) Each GPU replicates all experts for the first layer and gathers subsequent view at source ↗

**Figure 8.** Figure 8: ZeRO-Prefill frontend scheduling with four GPUs, realized in three stages: (1) Prefix-aware routing picks the GPU with the longest block-level cache match; (2) Compute-aware tracking updates each GPU’s true-FLOPs load after prefix-sharing credit; (3) Overlap-aware balancing marks a GPU saturated once its load reaches the backendderived threshold T. when per-request cost is dominated by decoding. They viol… view at source ↗

**Figure 9.** Figure 9: End-to-end throughput on the aggregated real-world prefill-only workload. view at source ↗

**Figure 10.** Figure 10: Contribution of ZeRO-Prefill’s two design tiers on the real-world workload. DP+AsyncEP applies the backend of §6 under vLLM’s default scheduler; ZeRO-Prefill additionally applies the frontend of §7. Annotations report the throughput gain of adding the frontend over the backend-only configuration at each parallel degree. chronous MoE), DP+AsyncEP under vLLM’s default scheduler, and ZeRO-Prefill (backend +… view at source ↗

**Figure 11.** Figure 11: Throughput under synthetic workloads with no prefix reuse, across four context regimes on 8 view at source ↗

**Figure 12.** Figure 12: MFU on Qwen3-235B-A22B (H100, FP8) under synthetic no-prefix-reuse workloads, across four context regimes view at source ↗

read the original abstract

Production LLM workloads increasingly serve discriminative tasks, such as classification, recommendation, and verification, whose answers are read from the logits of a single prefill pass with no autoregressive decoding. Serving these prefill-only workloads on mixture-of-experts (MoE) models is bottlenecked not by compute but by the distributed execution required to fit the model: existing parallel strategies (tensor, expert, and pipeline parallelism) trade memory pressure for redundant computation, communication, and synchronization, severely degrading MoE prefill serving efficiency. We observe that these overheads stem from coupling expert placement with synchronous activation routing -- a design inherited from the decoding era. The long, compute-bound forward passes of large-batch prefill open a per-layer window wide enough to stream expert weights in the background, replacing per-layer activation AllToAll with asynchronous weight AllGather fully overlapped with computation. We propose ZeRO-Prefill, a prefill-only serving system whose backend, AsyncEP (Asynchronous Expert Parallelism), gathers experts by weight rather than routing them by activation, and whose frontend co-enforces a physically-derived saturation threshold through prefix-aware routing and true-FLOPs load tracking. On Qwen3-235B-A22B across four hardware/precision configurations, ZeRO-Prefill delivers 1.35-1.37x throughput over the strongest distributed baseline on real-world workloads and up to 1.59x on long-context synthetic workloads, sustaining 29.8-36.2% per-GPU model FLOPs utilization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ZeRO-Prefill swaps activation AllToAll for overlapped weight AllGather in MoE prefill and reports 1.35-1.59x throughput gains, but the results leave the overlap assumption unverified.

read the letter

The main thing to know is that this paper targets a real bottleneck in MoE prefill serving by moving expert weights asynchronously instead of routing activations synchronously. They observe that large-batch prefill gives enough compute time per layer to hide the weight AllGather, then build AsyncEP around that plus a saturation threshold enforced by prefix-aware routing and true-FLOPs tracking. On Qwen3-235B they show 1.35-1.37x throughput on real workloads and up to 1.59x on synthetic long-context ones, with MFU in the 30-36% range across four hardware setups. That is a practical direction for prefill-only tasks like classification or verification that are already common in production.

Referee Report

2 major / 1 minor

Summary. The paper proposes ZeRO-Prefill, a prefill-only serving system for large MoE models such as Qwen3-235B-A22B. Its core contribution is AsyncEP (Asynchronous Expert Parallelism), which replaces per-layer activation AllToAll with asynchronous weight AllGather that is overlapped with the long compute-bound forward passes of large-batch prefill. The frontend adds prefix-aware routing and true-FLOPs load tracking to enforce a saturation threshold. On four hardware/precision configurations the system is reported to deliver 1.35-1.37x throughput over the strongest baseline on real-world workloads and up to 1.59x on long-context synthetic workloads while sustaining 29.8-36.2% per-GPU model FLOPs utilization.

Significance. If the reported speedups are substantiated, ZeRO-Prefill would demonstrate a practical way to remove communication redundancy in MoE prefill serving by exploiting the compute-bound character of prefill phases. This would be relevant for the growing class of discriminative, prefill-only LLM workloads and could improve hardware utilization without introducing redundant computation or memory pressure.

major comments (2)

[Abstract and Evaluation] Abstract and Evaluation section: the central throughput claims (1.35-1.59x) rest on the assumption that per-layer prefill compute time is long enough to fully hide asynchronous weight AllGather latency. The manuscript supplies no per-layer timing breakdowns, AllGather volume measurements, or overlap-efficiency numbers for Qwen3-235B-A22B across the four configurations. Without these data it is impossible to confirm that residual communication latency does not offset the savings from eliminating activation AllToAll.
[Abstract] Abstract: the reported speedups and MFU figures are given without error bars, without a precise description of the baseline implementations (tensor, expert, and pipeline parallelism), and without workload details beyond the labels 'real-world' and 'synthetic'. These omissions make the quantitative claims difficult to reproduce or compare.

minor comments (1)

The term 'physically-derived saturation threshold' is introduced without an explicit derivation or reference to the underlying physical model; a short appendix or equation would clarify how the threshold is obtained from hardware parameters.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and substantiation of the results.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation section: the central throughput claims (1.35-1.59x) rest on the assumption that per-layer prefill compute time is long enough to fully hide asynchronous weight AllGather latency. The manuscript supplies no per-layer timing breakdowns, AllGather volume measurements, or overlap-efficiency numbers for Qwen3-235B-A22B across the four configurations. Without these data it is impossible to confirm that residual communication latency does not offset the savings from eliminating activation AllToAll.

Authors: We agree that explicit measurements are required to validate the overlap assumption. In the revised manuscript we will add per-layer timing breakdowns, AllGather communication volume data, and overlap-efficiency percentages for Qwen3-235B-A22B on all four hardware/precision configurations. These additions will quantify the compute window available for hiding the asynchronous weight AllGather and confirm that residual communication latency remains negligible relative to the savings from removing activation AllToAll. revision: yes
Referee: [Abstract] Abstract: the reported speedups and MFU figures are given without error bars, without a precise description of the baseline implementations (tensor, expert, and pipeline parallelism), and without workload details beyond the labels 'real-world' and 'synthetic'. These omissions make the quantitative claims difficult to reproduce or compare.

Authors: The referee is correct that reproducibility requires these details. We will expand the abstract and evaluation sections to include error bars on all throughput and MFU numbers, precise specifications of the baseline tensor, expert, and pipeline parallelism configurations (including degree and implementation), and additional workload characteristics such as sequence-length distributions and batch-size ranges for both the real-world and synthetic traces. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical system evaluation with no self-referential derivations

full rationale

The paper's core contribution is an empirical system (ZeRO-Prefill with AsyncEP) that replaces per-layer activation AllToAll with overlapped asynchronous weight AllGather, justified by the stated observation that large-batch prefill forward passes provide sufficient compute time to hide communication. Throughput gains (1.35-1.59x) and MFU figures (29.8-36.2%) are presented as direct hardware measurements on Qwen3-235B-A22B across configurations, not as outputs of any equations, fitted parameters, or predictions that reduce to the inputs by construction. No mathematical derivations, uniqueness theorems, ansatzes, or self-citations are invoked as load-bearing steps in a derivation chain; the feasibility of overlap is treated as an empirical question verified by end-to-end benchmarks rather than assumed internally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that prefill passes are sufficiently long and compute-bound to hide weight movement latency, plus standard distributed-systems assumptions about network and memory bandwidth; no new physical constants or ad-hoc fitted scalars are introduced in the abstract.

axioms (1)

domain assumption Large-batch prefill forward passes are long enough to fully overlap asynchronous weight AllGather with computation.
Stated in the abstract as the key observation enabling the design.

invented entities (1)

AsyncEP (Asynchronous Expert Parallelism) no independent evidence
purpose: Backend that gathers experts by weight rather than routing by activation.
New serving backend introduced to replace synchronous activation AllToAll.

pith-pipeline@v0.9.0 · 5614 in / 1531 out tokens · 31007 ms · 2026-05-08T19:25:03.739227+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

T = t_EP × F_GPU × γ, where γ ≥ 1 (e.g., 1.2) absorbs transient jitter. T is a physical quantity computed once at startup from hardware and model configuration.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 22 canonical work pages · 12 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt- oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review arXiv 2025
[2]

Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills,

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachan- dran Ramjee. Sarathi: Efficient llm inference by piggy- backing decodes with chunked prefills.arXiv preprint arXiv:2308.16369, 2023

work page arXiv 2023
[3]

Deepspeed-inference: enabling efficient in- ference of transformer models at unprecedented scale

Reza Yazdani Aminabadi, Samyam Rajbhandari, Am- mar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. Deepspeed-inference: enabling efficient in- ference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, ...

2022
[4]

A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 2025

Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 2025

2025
[5]

Moe-lightning: High-throughput moe inference on memory-constrained gpus

Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xi- aoxuan Liu, Ying Sheng, Joseph E Gonzalez, Matei Za- haria, and Ion Stoica. Moe-lightning: High-throughput moe inference on memory-constrained gpus. InPro- ceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pages 715–...

2025
[6]

LexGLUE: A benchmark dataset for legal language understanding in English

Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bom- marito, Ion Androutsopoulos, Daniel Martin Katz, and Nikolaos Aletras. LexGLUE: A benchmark dataset for legal language understanding in English. InProceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4310–4330, 2022

2022
[7]

Palm: Scaling language modeling with pathways.Journal of machine learning research, 24(240):1–113, 2023

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.Journal of machine learning research, 24(240):1–113, 2023

2023
[8]

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chapter of the association for com- putational linguistics: Human language technologies, volume 1 (long and short papers)...

2019
[9]

Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

2022
[10]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingx- uan Wang, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page Pith review arXiv 2024
[11]

GoE- motions: A dataset of fine-grained emotions

Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. GoE- motions: A dataset of fine-grained emotions. InPro- ceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4040–4054, 2020

2020
[12]

Prefillonly: An infer- ence engine for prefill-only workloads in large language model applications

Kuntai Du, Bowen Wang, Chen Zhang, Yiming Cheng, Qing Lan, Hejian Sang, Yihua Cheng, Jiayi Yao, Xi- aoxuan Liu, Yifan Qiao, et al. Prefillonly: An infer- ence engine for prefill-only workloads in large language model applications. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 399–414, 2025

2025
[13]

Glam: Efficient scaling of language models with mixture-of-experts

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. In International conference on machine learning, pages 5547–5569. PMLR, 2022

2022
[14]

Moral stories: Situated reason- ing about norms, intents, actions, and their consequences

Denis Emelin, Ronan Le Bras, Jena D Hwang, Maxwell Forbes, and Yejin Choi. Moral stories: Situated reason- ing about norms, intents, actions, and their consequences. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 698– 718, 2021

2021
[15]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learn- ing Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learn- ing Research, 23(120):1–39, 2022

2022
[16]

Cascade Inference

FlashInfer. Cascade Inference. https://flashinfer. ai/2024/02/02/cascade-inference.html, 2024. Blog post. Accessed: 2026-04-23

2024
[17]

Megablocks: Efficient sparse training with mixture-of-experts.Proceedings of Machine Learning and Systems, 5:288–304, 2023

Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. Megablocks: Efficient sparse training with mixture-of-experts.Proceedings of Machine Learning and Systems, 5:288–304, 2023

2023
[18]

{Cost-Efficient} large lan- guage model serving for multi-turn conversations with {CachedAttention}

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou 13 Yu, and Pengfei Zuo. {Cost-Efficient} large lan- guage model serving for multi-turn conversations with {CachedAttention}. In2024 USENIX annual technical conference (USENIX ATC 24), pages 111–126, 2024

2024
[19]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Ariel Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review arXiv 2024
[20]

Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts

Naibin Gu, Zhenyu Zhang, Yuchen Feng, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, et al. Elastic moe: Unlocking the inference-time scalability of mixture-of-experts.arXiv preprint arXiv:2509.21892, 2025

work page internal anchor Pith review arXiv 2025
[21]

Sti: Turbocharge nlp inference at the edge via elastic pipelin- ing

Liwei Guo, Wonkyo Choe, and Felix Xiaozhu Lin. Sti: Turbocharge nlp inference at the edge via elastic pipelin- ing. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 791– 803, 2023

2023
[22]

FastMoE: A fast mixture-of-expert training system.arXiv preprint arXiv:2103.13262,

Jiaao He, Jiezhong Qiu, Aohan Zeng, Zhilin Yang, Jidong Zhai, and Jie Tang. Fastmoe: A fast mixture-of-expert training system.arXiv preprint arXiv:2103.13262, 2021

work page arXiv 2021
[23]

Fastermoe: modeling and optimizing training of large-scale dy- namic pre-trained models

Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. Fastermoe: modeling and optimizing training of large-scale dy- namic pre-trained models. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 120–134, 2022

2022
[24]

Long document classification from local word glimpses via recurrent attention learning.IEEE Access, 7:40707– 40718, 2019

Jun He, Liqun Wang, Liu Liu, Jiao Feng, and Hao Wu. Long document classification from local word glimpses via recurrent attention learning.IEEE Access, 7:40707– 40718, 2019

2019
[25]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review arXiv 2009
[26]

Gpipe: Effi- cient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Effi- cient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

2019
[27]

Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems, 5:269–287, 2023

Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, et al. Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems, 5:269–287, 2023

2023
[28]

Pre- gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference

Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, and Mao Yang. Pre- gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 1018–1031. IEEE, 2024

2024
[29]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review arXiv 2023
[30]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page Pith review arXiv 2024
[31]

Lancet: Accelerating mixture- of-experts training via whole graph computation- communication overlapping.Proceedings of Machine Learning and Systems, 6:74–86, 2024

Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, and Yida Wang. Lancet: Accelerating mixture- of-experts training via whole graph computation- communication overlapping.Proceedings of Machine Learning and Systems, 6:74–86, 2024

2024
[32]

Hydragen: High-throughput llm inference with shared prefixes

Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y Fu, Christopher Ré, and Azalia Mirhoseini. Hydra- gen: High-throughput llm inference with shared prefixes. arXiv preprint arXiv:2402.05099, 2024

work page arXiv 2024
[33]

Fiddler: Cpu-gpu orchestration for fast inference of mixture-of-experts models.arXiv preprint arXiv:2402.07033, 2024

Keisuke Kamahori, Tian Tang, Yile Gu, Kan Zhu, and Baris Kasikci. Fiddler: Cpu-gpu orchestration for fast inference of mixture-of-experts models.arXiv preprint arXiv:2402.07033, 2024

work page arXiv 2024
[34]

Swapmoe: Serving off-the-shelf moe-based large language models with tunable memory budget

Rui Kong, Yuanchun Li, Qingtian Feng, Weijun Wang, Xiaozhou Ye, Ye Ouyang, Linghe Kong, and Yunxin Liu. Swapmoe: Serving off-the-shelf moe-based large language models with tunable memory budget. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6710–6720, 2024

2024
[35]

Reducing activation re- computation in large transformer models.Proceedings of Machine Learning and Systems, 5:341–353, 2023

Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation re- computation in large transformer models.Proceedings of Machine Learning and Systems, 5:341–353, 2023

2023
[36]

Efficient memory manage- ment for large language model serving with pagedatten- tion

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory manage- ment for large language model serving with pagedatten- tion. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. 14

2023
[37]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, De- hao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling gi- ant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020

work page internal anchor Pith review arXiv 2006
[38]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023

2023
[39]

Accelerating distributed {MoE} training and inference with lina

Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. Accelerating distributed {MoE} training and inference with lina. In2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 945–959, 2023

2023
[40]

arXiv preprint arXiv:2006.15704 , author =

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. Pytorch dis- tributed: Experiences on accelerating data parallel train- ing.arXiv preprint arXiv:2006.15704, 2020

work page arXiv 2006
[41]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring at- tention with blockwise transformers for near-infinite context.arXiv preprint arXiv:2310.01889, 2023

work page internal anchor Pith review arXiv 2023
[42]

Cachegen: Kv cache compression and streaming for fast large lan- guage model serving

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, et al. Cachegen: Kv cache compression and streaming for fast large lan- guage model serving. InProceedings of the ACM SIG- COMM 2024 Conference, pages 38–56, 2024

2024
[43]

Learning word vectors for sentiment analysis

Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. InProceedings of the 49th annual meeting of the association for computa- tional linguistics: Human language technologies, pages 142–150, 2011

2011
[44]

Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023

2023
[45]

Pipedream: Gen- eralized pipeline parallelism for dnn training

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: Gen- eralized pipeline parallelism for dnn training. InPro- ceedings of the 27th ACM symposium on operating sys- tems principles, pages 1–15, 2019

2019
[46]

Twitter Financial News Sentiment

Neural Magic. Twitter Financial News Sentiment. https://huggingface.co/datasets/zeroshot/ twitter-financial-news-sentiment , 2022. Hug- ging Face dataset. Accessed: 2026-04-23

2022
[47]

NVIDIA H100 Tensor Core GPU Archi- tecture Whitepaper

NVIDIA. NVIDIA H100 Tensor Core GPU Archi- tecture Whitepaper. https://www.nvidia.com/en- us/data-center/h100/, 2026. Accessed: 2026-04- 23

2026
[48]

TensorRT-LLM

NVIDIA. TensorRT-LLM. https://github.com/ NVIDIA/TensorRT-LLM, 2026. GitHub repository. Ac- cessed: 2026-04-23

2026
[49]

Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, et al. Quality: Question answering with long input texts, yes! InProceedings of the 2022 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies...

2022
[50]

Splitwise: Efficient generative llm inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024

2024
[51]

Eps- moe: Expert pipeline scheduler for cost-efficient moe inference.arXiv preprint arXiv:2410.12247, 2024

Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, and Xunliang Cai. Eps- moe: Expert pipeline scheduler for cost-efficient moe inference.arXiv preprint arXiv:2410.12247, 2024

work page arXiv 2024
[52]

Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang. Is chat- gpt a general-purpose natural language processing task solver? InProceedings of the 2023 conference on em- pirical methods in natural language processing, pages 1339–1384, 2023

2023
[53]

Mooncake: A kvcache- centric disaggregated architecture for llm serving.ACM Transactions on Storage, 2024

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. Mooncake: A kvcache- centric disaggregated architecture for llm serving.ACM Transactions on Storage, 2024

2024
[54]

Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Min- jia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. InInternational confer- ence on machine learning, pages 18332–18346. PMLR, 2022

2022
[55]

Zero: Memory optimizations toward train- ing trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward train- ing trillion parameter models. InSC20: international 15 conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

2020
[56]

Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning

Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the international conference for high per- formance computing, networking, storage and analysis, pages 1–14, 2021

2021
[57]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review arXiv 2017
[58]

Flexgen: High-throughput generative inference of large language models with a single gpu

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. InInternational Conference on Machine Learning, pages 31094–31116. PMLR, 2023

2023
[59]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review arXiv 1909
[60]

arXiv:2510.02613 [cs.DC]https://arxiv.org/abs/2510

Gursimran Singh, Timothy Yu, Haley Li, Cheng Chen, Hanieh Sadri, Qintao Zhang, Yu Zhang, Ying Xiong, Yong Zhang, and Zhenan Fan. Elasticmoe: An effi- cient auto scaling method for mixture-of-experts models. arXiv preprint arXiv:2510.02613, 2025

work page arXiv 2025
[61]

Text classifi- cation via large language models

Xiaofei Sun, Xiaoya Li, Jiwei Li, Fei Wu, Shangwei Guo, Tianwei Zhang, and Guoyin Wang. Text classifi- cation via large language models. InFindings of the As- sociation for Computational Linguistics: EMNLP 2023, pages 8990–9005, 2023

2023
[62]

The Toxicity Dataset

Surge AI. The Toxicity Dataset. https://github. com/surge-ai/toxicity, 2022. GitHub repository. Accessed: 2026-04-23

2022
[63]

Characterizing and optimizing llm inference workloads on cpu-gpu coupled architectures

Prabhu Vellaisamy, Thomas Labonte, Sourav Chakraborty, Matt Turner, Samantika Sury, and John Paul Shen. Characterizing and optimizing llm inference workloads on cpu-gpu coupled architectures. In2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 49–61. IEEE, 2025

2025
[64]

Moe-infinity: Offloading-efficient moe model serving.arXiv e-prints, pages arXiv–2401, 2024

Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. Moe-infinity: Offloading-efficient moe model serving.arXiv e-prints, pages arXiv–2401, 2024

2024
[65]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review arXiv 2025
[66]

Exploiting inter- layer expert affinity for accelerating mixture-of-experts model inference

Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Sub- ramoni, and Dhabaleswar K DK Panda. Exploiting inter- layer expert affinity for accelerating mixture-of-experts model inference. In2024 IEEE International parallel and distributed processing symposium (IPDPS), pages 915–925. IEEE, 2024

2024
[67]

Chunkatten- tion: Efficient self-attention with prefix-aware kv cache and two-phase partition

Lu Ye, Ze Tao, Yong Huang, and Yang Li. Chunkatten- tion: Efficient self-attention with prefix-aware kv cache and two-phase partition. InProceedings of the 62nd An- nual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 11608–11620, 2024

2024
[68]

Orca: A distributed serving system for {Transformer-Based} generative models

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo- jeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX symposium on operating sys- tems design and implementation (OSDI 22), pages 521– 538, 2022

2022
[69]

Recommendation as instruction following: A large language model empow- ered recommendation approach.ACM Transactions on Information Systems, 43(5):1–37, 2026

Junjie Zhang, Ruobing Xie, Yupeng Hou, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. Recommendation as instruction following: A large language model empow- ered recommendation approach.ACM Transactions on Information Systems, 43(5):1–37, 2026

2026
[70]

arXiv preprint arXiv:2411.16102 , year=

Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, and Ion Sto- ica. Blendserve: Optimizing offline inference for auto- regressive large models with resource-aware batching. arXiv preprint arXiv:2411.16102, 2024

work page arXiv 2024
[71]

Sglang: Efficient execution of structured language model pro- grams.Advances in neural information processing sys- tems, 37:62557–62583, 2024

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model pro- grams.Advances in neural information processing sys- tems, 37:62557–62583, 2024

2024
[72]

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Zhen Zheng, Xin Ji, Taosong Fang, Fanghao Zhou, Chuanjie Liu, and Gang Peng. Batchllm: Optimizing large batched llm inference with global prefix sharing and throughput-oriented token batching.arXiv preprint arXiv:2412.03594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[73]

{DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, 2024. 16

2024
[74]

Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 35:7103–7114, 2022

Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 35:7103–7114, 2022

2022
[75]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yan- ping Huang, Jeff Dean, Noam Shazeer, and William Fe- dus. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022. Appendix A Scheduling Algorithm Pseudocode Algorithm 1 summarizes one scheduling round of ZeRO-Prefill’s frontend (§7), integrating prefix-aware...

work page internal anchor Pith review arXiv 2022