pith. machine review for the scientific record. sign in

arxiv: 2605.02960 · v1 · submitted 2026-05-03 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:25 UTC · model grok-4.3

classification 💻 cs.LG
keywords mixture of expertsprefill servingasynchronous expert parallelismLLM inferencedistributed systemsmodel parallelismthroughput optimizationexpert routing
0
0 comments X

The pith

ZeRO-Prefill replaces per-layer activation AllToAll with fully overlapped asynchronous expert weight AllGather to remove redundancy in MoE prefill serving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that existing tensor, expert, and pipeline parallelism strategies for MoE models impose redundant computation, communication, and synchronization costs when used for prefill-only workloads such as classification and recommendation. These costs trace back to the design choice of tying expert placement to synchronous activation routing, which was carried over from autoregressive decoding. Because large-batch prefill passes are long and compute-bound, a per-layer window exists to stream expert weights in the background instead. ZeRO-Prefill implements this via its AsyncEP backend and a prefix-aware frontend that tracks true FLOPs to enforce saturation thresholds. The approach yields concrete throughput gains while raising per-GPU model utilization on large MoE models.

Core claim

The overheads in MoE prefill stem from coupling expert placement with synchronous activation routing; these can be eliminated by gathering experts via weight AllGather that is fully overlapped with computation in long, compute-bound prefill layers, implemented in AsyncEP together with prefix-aware routing and true-FLOPs load tracking.

What carries the argument

AsyncEP (Asynchronous Expert Parallelism), which gathers experts by weight AllGather overlapped with computation rather than routing activations synchronously.

If this is right

  • Delivers 1.35-1.37x throughput over the strongest distributed baseline on real-world prefill workloads.
  • Reaches up to 1.59x throughput on long-context synthetic workloads.
  • Sustains 29.8-36.2% per-GPU model FLOPs utilization across four hardware and precision configurations.
  • Enables efficient prefill-only serving of discriminative tasks on models such as Qwen3-235B-A22B without redundant activation communication.
  • Removes the need for per-layer activation AllToAll in prefill by shifting to background weight streaming.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same overlap principle could be tested in other long compute phases of inference if hardware bandwidth allows full hiding of weight movement.
  • Hardware designers might reduce emphasis on AllToAll-optimized interconnects for prefill-dominant MoE deployments if the weight-streaming approach scales.
  • Future MoE training runs could incorporate the saturation threshold logic to produce models that are easier to serve under prefill-only patterns.

Load-bearing premise

The compute-bound forward passes of large-batch prefill last long enough for the full expert weight AllGather to complete in the background without becoming a new latency bottleneck.

What would settle it

Measure the wall-clock time of a full expert-weight AllGather on the target GPUs and compare it directly to the per-layer compute time observed in a large-batch prefill run; if AllGather time exceeds the available overlap window, the claimed throughput gains should vanish.

Figures

Figures reproduced from arXiv: 2605.02960 by Aurick Qiao, Juncheng Yang, Karthik Ganesan, Olatunji Ruwase, Samyam Rajbhandari, Yue Cheng, Yuxiong He, Zhaoyuan Su.

Figure 1
Figure 1. Figure 1: Prefill-only workloads dominate LLM serving input view at source ↗
Figure 2
Figure 2. Figure 2: MoE model size vs. per-GPU HBM capacity across view at source ↗
Figure 4
Figure 4. Figure 4: Expert routing imbalance of Qwen3-30B-A3B on view at source ↗
Figure 5
Figure 5. Figure 5: ZeRO-Prefill system architecture and end-to-end prefill-only serving workflow. The frontend normalizes incoming tasks into prefill-only form and schedules them into saturation-bounded batches with prefix affinity; the backend executes each batch under data-parallel attention and asynchronous expert streaming, returning logits without entering any decoding loop. Frontend Scheduler (Router) GPU 0 GPU 1 GPU 2… view at source ↗
Figure 6
Figure 6. Figure 6: Conventional synchronous DP+EP with four GPUs. view at source ↗
Figure 7
Figure 7. Figure 7: AsyncEP execution models with four GPUs. (a) Each GPU replicates all experts for the first layer and gathers subsequent view at source ↗
Figure 8
Figure 8. Figure 8: ZeRO-Prefill frontend scheduling with four GPUs, realized in three stages: (1) Prefix-aware routing picks the GPU with the longest block-level cache match; (2) Compute-aware tracking updates each GPU’s true-FLOPs load after prefix-sharing credit; (3) Overlap-aware balancing marks a GPU saturated once its load reaches the backend￾derived threshold T. when per-request cost is dominated by decoding. They viol… view at source ↗
Figure 9
Figure 9. Figure 9: End-to-end throughput on the aggregated real-world prefill-only workload. view at source ↗
Figure 10
Figure 10. Figure 10: Contribution of ZeRO-Prefill’s two design tiers on the real-world workload. DP+AsyncEP applies the backend of §6 under vLLM’s default scheduler; ZeRO-Prefill additionally applies the frontend of §7. Annotations report the throughput gain of adding the frontend over the backend-only configuration at each parallel degree. chronous MoE), DP+AsyncEP under vLLM’s default sched￾uler, and ZeRO-Prefill (backend +… view at source ↗
Figure 11
Figure 11. Figure 11: Throughput under synthetic workloads with no prefix reuse, across four context regimes on 8 view at source ↗
Figure 12
Figure 12. Figure 12: MFU on Qwen3-235B-A22B (H100, FP8) under synthetic no-prefix-reuse workloads, across four context regimes view at source ↗
read the original abstract

Production LLM workloads increasingly serve discriminative tasks, such as classification, recommendation, and verification, whose answers are read from the logits of a single prefill pass with no autoregressive decoding. Serving these prefill-only workloads on mixture-of-experts (MoE) models is bottlenecked not by compute but by the distributed execution required to fit the model: existing parallel strategies (tensor, expert, and pipeline parallelism) trade memory pressure for redundant computation, communication, and synchronization, severely degrading MoE prefill serving efficiency. We observe that these overheads stem from coupling expert placement with synchronous activation routing -- a design inherited from the decoding era. The long, compute-bound forward passes of large-batch prefill open a per-layer window wide enough to stream expert weights in the background, replacing per-layer activation AllToAll with asynchronous weight AllGather fully overlapped with computation. We propose ZeRO-Prefill, a prefill-only serving system whose backend, AsyncEP (Asynchronous Expert Parallelism), gathers experts by weight rather than routing them by activation, and whose frontend co-enforces a physically-derived saturation threshold through prefix-aware routing and true-FLOPs load tracking. On Qwen3-235B-A22B across four hardware/precision configurations, ZeRO-Prefill delivers 1.35-1.37x throughput over the strongest distributed baseline on real-world workloads and up to 1.59x on long-context synthetic workloads, sustaining 29.8-36.2% per-GPU model FLOPs utilization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes ZeRO-Prefill, a prefill-only serving system for large MoE models such as Qwen3-235B-A22B. Its core contribution is AsyncEP (Asynchronous Expert Parallelism), which replaces per-layer activation AllToAll with asynchronous weight AllGather that is overlapped with the long compute-bound forward passes of large-batch prefill. The frontend adds prefix-aware routing and true-FLOPs load tracking to enforce a saturation threshold. On four hardware/precision configurations the system is reported to deliver 1.35-1.37x throughput over the strongest baseline on real-world workloads and up to 1.59x on long-context synthetic workloads while sustaining 29.8-36.2% per-GPU model FLOPs utilization.

Significance. If the reported speedups are substantiated, ZeRO-Prefill would demonstrate a practical way to remove communication redundancy in MoE prefill serving by exploiting the compute-bound character of prefill phases. This would be relevant for the growing class of discriminative, prefill-only LLM workloads and could improve hardware utilization without introducing redundant computation or memory pressure.

major comments (2)
  1. [Abstract and Evaluation] Abstract and Evaluation section: the central throughput claims (1.35-1.59x) rest on the assumption that per-layer prefill compute time is long enough to fully hide asynchronous weight AllGather latency. The manuscript supplies no per-layer timing breakdowns, AllGather volume measurements, or overlap-efficiency numbers for Qwen3-235B-A22B across the four configurations. Without these data it is impossible to confirm that residual communication latency does not offset the savings from eliminating activation AllToAll.
  2. [Abstract] Abstract: the reported speedups and MFU figures are given without error bars, without a precise description of the baseline implementations (tensor, expert, and pipeline parallelism), and without workload details beyond the labels 'real-world' and 'synthetic'. These omissions make the quantitative claims difficult to reproduce or compare.
minor comments (1)
  1. The term 'physically-derived saturation threshold' is introduced without an explicit derivation or reference to the underlying physical model; a short appendix or equation would clarify how the threshold is obtained from hardware parameters.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and substantiation of the results.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation section: the central throughput claims (1.35-1.59x) rest on the assumption that per-layer prefill compute time is long enough to fully hide asynchronous weight AllGather latency. The manuscript supplies no per-layer timing breakdowns, AllGather volume measurements, or overlap-efficiency numbers for Qwen3-235B-A22B across the four configurations. Without these data it is impossible to confirm that residual communication latency does not offset the savings from eliminating activation AllToAll.

    Authors: We agree that explicit measurements are required to validate the overlap assumption. In the revised manuscript we will add per-layer timing breakdowns, AllGather communication volume data, and overlap-efficiency percentages for Qwen3-235B-A22B on all four hardware/precision configurations. These additions will quantify the compute window available for hiding the asynchronous weight AllGather and confirm that residual communication latency remains negligible relative to the savings from removing activation AllToAll. revision: yes

  2. Referee: [Abstract] Abstract: the reported speedups and MFU figures are given without error bars, without a precise description of the baseline implementations (tensor, expert, and pipeline parallelism), and without workload details beyond the labels 'real-world' and 'synthetic'. These omissions make the quantitative claims difficult to reproduce or compare.

    Authors: The referee is correct that reproducibility requires these details. We will expand the abstract and evaluation sections to include error bars on all throughput and MFU numbers, precise specifications of the baseline tensor, expert, and pipeline parallelism configurations (including degree and implementation), and additional workload characteristics such as sequence-length distributions and batch-size ranges for both the real-world and synthetic traces. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical system evaluation with no self-referential derivations

full rationale

The paper's core contribution is an empirical system (ZeRO-Prefill with AsyncEP) that replaces per-layer activation AllToAll with overlapped asynchronous weight AllGather, justified by the stated observation that large-batch prefill forward passes provide sufficient compute time to hide communication. Throughput gains (1.35-1.59x) and MFU figures (29.8-36.2%) are presented as direct hardware measurements on Qwen3-235B-A22B across configurations, not as outputs of any equations, fitted parameters, or predictions that reduce to the inputs by construction. No mathematical derivations, uniqueness theorems, ansatzes, or self-citations are invoked as load-bearing steps in a derivation chain; the feasibility of overlap is treated as an empirical question verified by end-to-end benchmarks rather than assumed internally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that prefill passes are sufficiently long and compute-bound to hide weight movement latency, plus standard distributed-systems assumptions about network and memory bandwidth; no new physical constants or ad-hoc fitted scalars are introduced in the abstract.

axioms (1)
  • domain assumption Large-batch prefill forward passes are long enough to fully overlap asynchronous weight AllGather with computation.
    Stated in the abstract as the key observation enabling the design.
invented entities (1)
  • AsyncEP (Asynchronous Expert Parallelism) no independent evidence
    purpose: Backend that gathers experts by weight rather than routing by activation.
    New serving backend introduced to replace synchronous activation AllToAll.

pith-pipeline@v0.9.0 · 5614 in / 1531 out tokens · 31007 ms · 2026-05-08T19:25:03.739227+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 22 canonical work pages · 12 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt- oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

  2. [2]

    Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills,

    Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachan- dran Ramjee. Sarathi: Efficient llm inference by piggy- backing decodes with chunked prefills.arXiv preprint arXiv:2308.16369, 2023

  3. [3]

    Deepspeed-inference: enabling efficient in- ference of transformer models at unprecedented scale

    Reza Yazdani Aminabadi, Samyam Rajbhandari, Am- mar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. Deepspeed-inference: enabling efficient in- ference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, ...

  4. [4]

    A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 2025

    Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 2025

  5. [5]

    Moe-lightning: High-throughput moe inference on memory-constrained gpus

    Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xi- aoxuan Liu, Ying Sheng, Joseph E Gonzalez, Matei Za- haria, and Ion Stoica. Moe-lightning: High-throughput moe inference on memory-constrained gpus. InPro- ceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pages 715–...

  6. [6]

    LexGLUE: A benchmark dataset for legal language understanding in English

    Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bom- marito, Ion Androutsopoulos, Daniel Martin Katz, and Nikolaos Aletras. LexGLUE: A benchmark dataset for legal language understanding in English. InProceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4310–4330, 2022

  7. [7]

    Palm: Scaling language modeling with pathways.Journal of machine learning research, 24(240):1–113, 2023

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.Journal of machine learning research, 24(240):1–113, 2023

  8. [8]

    Boolq: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chapter of the association for com- putational linguistics: Human language technologies, volume 1 (long and short papers)...

  9. [9]

    Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

  10. [10]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingx- uan Wang, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  11. [11]

    GoE- motions: A dataset of fine-grained emotions

    Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. GoE- motions: A dataset of fine-grained emotions. InPro- ceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4040–4054, 2020

  12. [12]

    Prefillonly: An infer- ence engine for prefill-only workloads in large language model applications

    Kuntai Du, Bowen Wang, Chen Zhang, Yiming Cheng, Qing Lan, Hejian Sang, Yihua Cheng, Jiayi Yao, Xi- aoxuan Liu, Yifan Qiao, et al. Prefillonly: An infer- ence engine for prefill-only workloads in large language model applications. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 399–414, 2025

  13. [13]

    Glam: Efficient scaling of language models with mixture-of-experts

    Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. In International conference on machine learning, pages 5547–5569. PMLR, 2022

  14. [14]

    Moral stories: Situated reason- ing about norms, intents, actions, and their consequences

    Denis Emelin, Ronan Le Bras, Jena D Hwang, Maxwell Forbes, and Yejin Choi. Moral stories: Situated reason- ing about norms, intents, actions, and their consequences. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 698– 718, 2021

  15. [15]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learn- ing Research, 23(120):1–39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learn- ing Research, 23(120):1–39, 2022

  16. [16]

    Cascade Inference

    FlashInfer. Cascade Inference. https://flashinfer. ai/2024/02/02/cascade-inference.html, 2024. Blog post. Accessed: 2026-04-23

  17. [17]

    Megablocks: Efficient sparse training with mixture-of-experts.Proceedings of Machine Learning and Systems, 5:288–304, 2023

    Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. Megablocks: Efficient sparse training with mixture-of-experts.Proceedings of Machine Learning and Systems, 5:288–304, 2023

  18. [18]

    {Cost-Efficient} large lan- guage model serving for multi-turn conversations with {CachedAttention}

    Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou 13 Yu, and Pengfei Zuo. {Cost-Efficient} large lan- guage model serving for multi-turn conversations with {CachedAttention}. In2024 USENIX annual technical conference (USENIX ATC 24), pages 111–126, 2024

  19. [19]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Ariel Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  20. [20]

    Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts

    Naibin Gu, Zhenyu Zhang, Yuchen Feng, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, et al. Elastic moe: Unlocking the inference-time scalability of mixture-of-experts.arXiv preprint arXiv:2509.21892, 2025

  21. [21]

    Sti: Turbocharge nlp inference at the edge via elastic pipelin- ing

    Liwei Guo, Wonkyo Choe, and Felix Xiaozhu Lin. Sti: Turbocharge nlp inference at the edge via elastic pipelin- ing. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 791– 803, 2023

  22. [22]

    FastMoE: A fast mixture-of-expert training system.arXiv preprint arXiv:2103.13262,

    Jiaao He, Jiezhong Qiu, Aohan Zeng, Zhilin Yang, Jidong Zhai, and Jie Tang. Fastmoe: A fast mixture-of-expert training system.arXiv preprint arXiv:2103.13262, 2021

  23. [23]

    Fastermoe: modeling and optimizing training of large-scale dy- namic pre-trained models

    Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. Fastermoe: modeling and optimizing training of large-scale dy- namic pre-trained models. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 120–134, 2022

  24. [24]

    Long document classification from local word glimpses via recurrent attention learning.IEEE Access, 7:40707– 40718, 2019

    Jun He, Liqun Wang, Liu Liu, Jiao Feng, and Hao Wu. Long document classification from local word glimpses via recurrent attention learning.IEEE Access, 7:40707– 40718, 2019

  25. [25]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

  26. [26]

    Gpipe: Effi- cient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Effi- cient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

  27. [27]

    Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems, 5:269–287, 2023

    Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, et al. Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems, 5:269–287, 2023

  28. [28]

    Pre- gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference

    Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, and Mao Yang. Pre- gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 1018–1031. IEEE, 2024

  29. [29]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

  30. [30]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

  31. [31]

    Lancet: Accelerating mixture- of-experts training via whole graph computation- communication overlapping.Proceedings of Machine Learning and Systems, 6:74–86, 2024

    Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, and Yida Wang. Lancet: Accelerating mixture- of-experts training via whole graph computation- communication overlapping.Proceedings of Machine Learning and Systems, 6:74–86, 2024

  32. [32]

    Hydragen: High-throughput llm inference with shared prefixes

    Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y Fu, Christopher Ré, and Azalia Mirhoseini. Hydra- gen: High-throughput llm inference with shared prefixes. arXiv preprint arXiv:2402.05099, 2024

  33. [33]

    Fiddler: Cpu-gpu orchestration for fast inference of mixture-of-experts models.arXiv preprint arXiv:2402.07033, 2024

    Keisuke Kamahori, Tian Tang, Yile Gu, Kan Zhu, and Baris Kasikci. Fiddler: Cpu-gpu orchestration for fast inference of mixture-of-experts models.arXiv preprint arXiv:2402.07033, 2024

  34. [34]

    Swapmoe: Serving off-the-shelf moe-based large language models with tunable memory budget

    Rui Kong, Yuanchun Li, Qingtian Feng, Weijun Wang, Xiaozhou Ye, Ye Ouyang, Linghe Kong, and Yunxin Liu. Swapmoe: Serving off-the-shelf moe-based large language models with tunable memory budget. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6710–6720, 2024

  35. [35]

    Reducing activation re- computation in large transformer models.Proceedings of Machine Learning and Systems, 5:341–353, 2023

    Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation re- computation in large transformer models.Proceedings of Machine Learning and Systems, 5:341–353, 2023

  36. [36]

    Efficient memory manage- ment for large language model serving with pagedatten- tion

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory manage- ment for large language model serving with pagedatten- tion. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. 14

  37. [37]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, De- hao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling gi- ant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020

  38. [38]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023

  39. [39]

    Accelerating distributed {MoE} training and inference with lina

    Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. Accelerating distributed {MoE} training and inference with lina. In2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 945–959, 2023

  40. [40]

    arXiv preprint arXiv:2006.15704 , author =

    Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. Pytorch dis- tributed: Experiences on accelerating data parallel train- ing.arXiv preprint arXiv:2006.15704, 2020

  41. [41]

    Ring Attention with Blockwise Transformers for Near-Infinite Context

    Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring at- tention with blockwise transformers for near-infinite context.arXiv preprint arXiv:2310.01889, 2023

  42. [42]

    Cachegen: Kv cache compression and streaming for fast large lan- guage model serving

    Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, et al. Cachegen: Kv cache compression and streaming for fast large lan- guage model serving. InProceedings of the ACM SIG- COMM 2024 Conference, pages 38–56, 2024

  43. [43]

    Learning word vectors for sentiment analysis

    Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. InProceedings of the 49th annual meeting of the association for computa- tional linguistics: Human language technologies, pages 142–150, 2011

  44. [44]

    Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023

  45. [45]

    Pipedream: Gen- eralized pipeline parallelism for dnn training

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: Gen- eralized pipeline parallelism for dnn training. InPro- ceedings of the 27th ACM symposium on operating sys- tems principles, pages 1–15, 2019

  46. [46]

    Twitter Financial News Sentiment

    Neural Magic. Twitter Financial News Sentiment. https://huggingface.co/datasets/zeroshot/ twitter-financial-news-sentiment , 2022. Hug- ging Face dataset. Accessed: 2026-04-23

  47. [47]

    NVIDIA H100 Tensor Core GPU Archi- tecture Whitepaper

    NVIDIA. NVIDIA H100 Tensor Core GPU Archi- tecture Whitepaper. https://www.nvidia.com/en- us/data-center/h100/, 2026. Accessed: 2026-04- 23

  48. [48]

    TensorRT-LLM

    NVIDIA. TensorRT-LLM. https://github.com/ NVIDIA/TensorRT-LLM, 2026. GitHub repository. Ac- cessed: 2026-04-23

  49. [49]

    Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, et al. Quality: Question answering with long input texts, yes! InProceedings of the 2022 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies...

  50. [50]

    Splitwise: Efficient generative llm inference using phase splitting

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024

  51. [51]

    Eps- moe: Expert pipeline scheduler for cost-efficient moe inference.arXiv preprint arXiv:2410.12247, 2024

    Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, and Xunliang Cai. Eps- moe: Expert pipeline scheduler for cost-efficient moe inference.arXiv preprint arXiv:2410.12247, 2024

  52. [52]

    Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang. Is chat- gpt a general-purpose natural language processing task solver? InProceedings of the 2023 conference on em- pirical methods in natural language processing, pages 1339–1384, 2023

  53. [53]

    Mooncake: A kvcache- centric disaggregated architecture for llm serving.ACM Transactions on Storage, 2024

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. Mooncake: A kvcache- centric disaggregated architecture for llm serving.ACM Transactions on Storage, 2024

  54. [54]

    Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale

    Samyam Rajbhandari, Conglong Li, Zhewei Yao, Min- jia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. InInternational confer- ence on machine learning, pages 18332–18346. PMLR, 2022

  55. [55]

    Zero: Memory optimizations toward train- ing trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward train- ing trillion parameter models. InSC20: international 15 conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

  56. [56]

    Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning

    Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the international conference for high per- formance computing, networking, storage and analysis, pages 1–14, 2021

  57. [57]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

  58. [58]

    Flexgen: High-throughput generative inference of large language models with a single gpu

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. InInternational Conference on Machine Learning, pages 31094–31116. PMLR, 2023

  59. [59]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

  60. [60]

    arXiv:2510.02613 [cs.DC]https://arxiv.org/abs/2510

    Gursimran Singh, Timothy Yu, Haley Li, Cheng Chen, Hanieh Sadri, Qintao Zhang, Yu Zhang, Ying Xiong, Yong Zhang, and Zhenan Fan. Elasticmoe: An effi- cient auto scaling method for mixture-of-experts models. arXiv preprint arXiv:2510.02613, 2025

  61. [61]

    Text classifi- cation via large language models

    Xiaofei Sun, Xiaoya Li, Jiwei Li, Fei Wu, Shangwei Guo, Tianwei Zhang, and Guoyin Wang. Text classifi- cation via large language models. InFindings of the As- sociation for Computational Linguistics: EMNLP 2023, pages 8990–9005, 2023

  62. [62]

    The Toxicity Dataset

    Surge AI. The Toxicity Dataset. https://github. com/surge-ai/toxicity, 2022. GitHub repository. Accessed: 2026-04-23

  63. [63]

    Characterizing and optimizing llm inference workloads on cpu-gpu coupled architectures

    Prabhu Vellaisamy, Thomas Labonte, Sourav Chakraborty, Matt Turner, Samantika Sury, and John Paul Shen. Characterizing and optimizing llm inference workloads on cpu-gpu coupled architectures. In2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 49–61. IEEE, 2025

  64. [64]

    Moe-infinity: Offloading-efficient moe model serving.arXiv e-prints, pages arXiv–2401, 2024

    Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. Moe-infinity: Offloading-efficient moe model serving.arXiv e-prints, pages arXiv–2401, 2024

  65. [65]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

  66. [66]

    Exploiting inter- layer expert affinity for accelerating mixture-of-experts model inference

    Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Sub- ramoni, and Dhabaleswar K DK Panda. Exploiting inter- layer expert affinity for accelerating mixture-of-experts model inference. In2024 IEEE International parallel and distributed processing symposium (IPDPS), pages 915–925. IEEE, 2024

  67. [67]

    Chunkatten- tion: Efficient self-attention with prefix-aware kv cache and two-phase partition

    Lu Ye, Ze Tao, Yong Huang, and Yang Li. Chunkatten- tion: Efficient self-attention with prefix-aware kv cache and two-phase partition. InProceedings of the 62nd An- nual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 11608–11620, 2024

  68. [68]

    Orca: A distributed serving system for {Transformer-Based} generative models

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo- jeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX symposium on operating sys- tems design and implementation (OSDI 22), pages 521– 538, 2022

  69. [69]

    Recommendation as instruction following: A large language model empow- ered recommendation approach.ACM Transactions on Information Systems, 43(5):1–37, 2026

    Junjie Zhang, Ruobing Xie, Yupeng Hou, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. Recommendation as instruction following: A large language model empow- ered recommendation approach.ACM Transactions on Information Systems, 43(5):1–37, 2026

  70. [70]

    arXiv preprint arXiv:2411.16102 , year=

    Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, and Ion Sto- ica. Blendserve: Optimizing offline inference for auto- regressive large models with resource-aware batching. arXiv preprint arXiv:2411.16102, 2024

  71. [71]

    Sglang: Efficient execution of structured language model pro- grams.Advances in neural information processing sys- tems, 37:62557–62583, 2024

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model pro- grams.Advances in neural information processing sys- tems, 37:62557–62583, 2024

  72. [72]

    BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

    Zhen Zheng, Xin Ji, Taosong Fang, Fanghao Zhou, Chuanjie Liu, and Gang Peng. Batchllm: Optimizing large batched llm inference with global prefix sharing and throughput-oriented token batching.arXiv preprint arXiv:2412.03594, 2024

  73. [73]

    {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, 2024. 16

  74. [74]

    Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 35:7103–7114, 2022

    Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 35:7103–7114, 2022

  75. [75]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yan- ping Huang, Jeff Dean, Noam Shazeer, and William Fe- dus. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022. Appendix A Scheduling Algorithm Pseudocode Algorithm 1 summarizes one scheduling round of ZeRO-Prefill’s frontend (§7), integrating prefix-aware...