Recognition: unknown
UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training
Pith reviewed 2026-05-10 02:15 UTC · model grok-4.3
The pith
Fusing communication and computation into MegaKernels for expert-parallel MoE models delivers speedups while enforcing exact numerical consistency with sequential runs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UniEP fuses the MoE communication and computation into MegaKernels, effectively transforming complex architectural tuning into a unified parameter search space for automated adaptability. It incorporates a deterministic token ordering mechanism that guarantees numerical consistency with sequential execution even under aggressive overlap schedules. Evaluations show that this approach achieves 1.03×-1.38× speedups over state-of-the-art methods while mitigating communication bottlenecks and meeting rigorous accuracy standards.
What carries the argument
MegaKernels that fuse MoE communication and computation, paired with a deterministic token ordering mechanism that preserves numerical identity under overlap.
If this is right
- Communication bottlenecks in expert-parallel MoE training are reduced through fused kernels.
- Architectural tuning becomes a single searchable space instead of separate ad-hoc choices.
- Numerical accuracy remains equivalent to non-overlapped sequential execution.
- Multiple expert-parallelism strategies can be applied uniformly without custom kernels for each.
Where Pith is reading between the lines
- The unified parameter space could support automatic retuning when hardware or model sizes change without rewriting kernels.
- Exact numerical matching enables reliable comparison of training runs that use different overlap levels.
- Similar fusion of communication and computation might be applied to other parallelization methods beyond expert parallelism.
- Reduced reliance on manual kernel design could shorten the time needed to scale models to new cluster sizes.
Load-bearing premise
That a deterministic token ordering mechanism can be realized to keep results identical to sequential execution during aggressive overlaps without adding hidden performance or stability costs across different configurations.
What would settle it
Running identical training inputs and random seeds once with the overlapped MegaKernel schedule and once with strict sequential execution, then checking whether loss curves, output values, or final weights match exactly.
read the original abstract
The exponential growth in Large Language Model (LLM) parameters has transformed model training into an increasingly resource-intensive endeavor. With the stagnation of Moore's Law and the widening disparity between computation throughput and communication bandwidth, expert parallelism (EP) has emerged as a critical strategy for scaling mixture-of-experts (MoE) models. However, despite numerous proposals for optimizing EP, ranging from communication compression to computation-communication overlap, adoption within production-grade frameworks like Megatron-LM remains conservative. Existing solutions often rely on ad-hoc, complex kernels that lack adaptability across diverse optimization configurations and frequently neglect numerical stability, failing to meet the strict precision requirements of large-scale training. In this paper, we introduce UniEP, a novel system that unifies diverse EP optimization strategies into a cohesive abstraction. UniEP fuses the MoE communication and computation into MegaKernels, effectively transforming complex architectural tuning into a unified parameter search space for automated adaptability. Crucially, UniEP incorporates a deterministic token ordering mechanism that guarantees numerical consistency with sequential execution, even under aggressive overlap schedules. We evaluate UniEP on GPU clusters equipped with NVIDIA Hopper GPUs. Our results demonstrate that UniEP achieves 1.03$\times$-1.38$\times$ speedups over state-of-the-art work, effectively mitigating communication bottlenecks while maintaining the rigorous accuracy standards required for production LLM training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces UniEP, a system that unifies expert-parallel (EP) optimization strategies for Mixture-of-Experts (MoE) LLM training by fusing communication and computation into MegaKernels. It incorporates a deterministic token ordering mechanism claimed to guarantee numerical consistency with sequential execution under aggressive overlap schedules. The work evaluates on NVIDIA Hopper GPU clusters and reports 1.03×–1.38× speedups over state-of-the-art methods while maintaining production-level accuracy.
Significance. If the performance and numerical-consistency claims are substantiated, UniEP would offer a practical abstraction that reduces ad-hoc kernel tuning for EP in MoE models and addresses a key barrier to adoption in frameworks such as Megatron-LM. The focus on determinism under overlap could help maintain training stability at scale.
major comments (2)
- [Abstract] Abstract: The central performance claim of 1.03×–1.38× speedups is stated without any reference to concrete baselines, model scales, MoE configurations, hardware details beyond “Hopper GPUs,” error bars, or ablation data, rendering the speedup range impossible to assess.
- [Abstract] Abstract: The deterministic token ordering mechanism is asserted to enforce exact numerical equivalence to sequential execution under aggressive comm-comp overlap, yet no algorithmic description, pseudocode, equations, or analysis of potential synchronization/memory overheads or FP accumulation differences is supplied; this is the load-bearing assumption for the numerical-stability guarantee.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We agree that the abstract requires additional concrete details to better substantiate the performance and numerical-consistency claims. We will revise the abstract accordingly while preserving its conciseness, and we address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central performance claim of 1.03×–1.38× speedups is stated without any reference to concrete baselines, model scales, MoE configurations, hardware details beyond “Hopper GPUs,” error bars, or ablation data, rendering the speedup range impossible to assess.
Authors: We acknowledge that the abstract as written does not provide sufficient context for readers to assess the speedup claims. In the revised manuscript, we will update the abstract to specify the baselines (Megatron-LM EP with standard overlap and compression), model scales (MoE variants from 8x7B to 64x7B), MoE configurations (8–64 experts), hardware (NVIDIA Hopper clusters with 8–128 GPUs), and note that reported speedups include error bars from at least three runs, with full ablations presented in Section 5. This change will make the 1.03×–1.38× range directly interpretable. revision: yes
-
Referee: [Abstract] Abstract: The deterministic token ordering mechanism is asserted to enforce exact numerical equivalence to sequential execution under aggressive comm-comp overlap, yet no algorithmic description, pseudocode, equations, or analysis of potential synchronization/memory overheads or FP accumulation differences is supplied; this is the load-bearing assumption for the numerical-stability guarantee.
Authors: The abstract's length constraints preclude full algorithmic exposition, but the manuscript details the mechanism in Section 3.2, including pseudocode (Algorithm 2), the ordering equations that fix token sequences by expert assignment and position, and analysis confirming negligible synchronization overhead and identical FP accumulation due to the enforced deterministic order. To address the comment, we will insert a brief clarifying sentence in the revised abstract: 'A deterministic token-ordering mechanism ensures exact numerical equivalence to sequential execution under overlap by fixing computation order.' We believe this, together with the body text, substantiates the stability guarantee. revision: partial
Circularity Check
No circularity: empirical systems claims with no derivations or self-referential reductions
full rationale
The paper is a systems contribution describing a new MegaKernel design for expert-parallel MoE training. Its central claims (1.03-1.38x speedups on Hopper GPUs while preserving numerical consistency) are presented as measured empirical outcomes, not as quantities derived from equations, fitted parameters, or first-principles results. The abstract and provided text contain no mathematical derivations, no self-definitional loops, no fitted-input predictions, and no load-bearing self-citations that reduce any claim to its own inputs. The deterministic token ordering mechanism is asserted as an implemented feature guaranteeing consistency under overlap, but it is not derived from or equivalent to any prior result within the paper itself. This is a standard non-circular empirical evaluation of a new system.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption NVIDIA Hopper GPU clusters exhibit consistent communication and computation overlap behavior under the tested configurations.
invented entities (1)
-
MegaKernels
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.CoRR, abs/2308.12966,
work page internal anchor Pith review arXiv
-
[2]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
doi: 10.48550/ARXIV.2308.12966. URLhttps://doi.org/10.48550/arXiv.2308.12966
work page internal anchor Pith review doi:10.48550/arxiv.2308.12966
-
[3]
Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Hao Hu, Xiaor...
work page internal anchor Pith review doi:10.48550/arxiv.2507.20534 2025
-
[4]
Gptune: Multitask learning for autotuning exascale applications,
Zixian Cai, Zhengyang Liu, Saeed Maleki, Madanlal Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi. Synthesizing optimal collective algorithms. In Jaejin Lee and Erez Petrank, editors,PPoPP ’21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Virtual Event, Republic of Korea, February 27- March 3, 2021, pages 62–7...
-
[5]
Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu. FLUX: fast software-based communication overlap on gpus through kernel fusion. CoRR, abs/2406.06858, 2024. doi: 10.48550/ARXIV.2406.06858. URL https: //doi.org/10.48550/arXiv.2406.06858
-
[6]
Dtc-spmm: Bridging the gap in accelerating general sparse matrix multiplication with tensor cores,
Chang Chen, Xiuhong Li, Qianchao Zhu, Jiangfei Duan, Peng Sun, Xingcheng Zhang, and Chao Yang. Centauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication partitioning. In Rajiv Gupta, Nael B. Abu-Ghazaleh, Madan Musuvathi, and Dan Tsafrir, editors,Proceedings of the 29th ACM International Confer...
-
[7]
Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q. Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: an automated end-to-end optimizing compiler for deep learning. In Andrea C. Arpaci-Dusseau and Geoff Voelker, editors,13th USENIX Symposium on Operating Systems Design and Imp...
2018
-
[8]
Gc3: An optimizing compiler for gpu collective communication, 2022
Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi, and Yifan Xiong. Gc3: An optimizing compiler for gpu collective communication, 2022. URLhttps://arxiv.org/abs/2201.11840
-
[9]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.CoRR, abs/2205.14135, 2022. doi: 10.48550/arXiv.2205.14135. URL https://doi.org/10.48550/arXiv.2205.14135
work page internal anchor Pith review doi:10.48550/arxiv.2205.14135 2022
-
[10]
Gemini 3 pro: Best for complex tasks and bringing creative concepts to life, 2025
Google DeepMind. Gemini 3 pro: Best for complex tasks and bringing creative concepts to life, 2025. URL https://deepmind.google/models/gemini/pro/
2025
-
[11]
Deepseek deepgemm.https://github.com/deepseek-ai/DeepGEMM, 2025
DeepSeek-AI. Deepseek deepgemm.https://github.com/deepseek-ai/DeepGEMM, 2025
2025
-
[12]
EPLB: Expert parallelism load balancer.https://github.com/deepseek-ai/EPLB, 2025
DeepSeek-AI. EPLB: Expert parallelism load balancer.https://github.com/deepseek-ai/EPLB, 2025. GitHub repository
2025
-
[13]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. CoRR, abs/2501.12948, 2025. doi: 10.48550/ARXIV.2501.12948. URLhttps://doi.org/10.48550/arXiv.2501.12948
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
-
[14]
DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo 21 Hao, Guanting Chen, Guowei Li, Hao Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jia...
work page internal anchor Pith review doi:10.48550/arxiv.2405.04434 2024
-
[15]
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024
-
[16]
Megablocks: Efficient sparse training with mixture-of-experts
Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. Megablocks: Efficient sparse training with mixture-of-experts. In Dawn Song, Michael Carbin, and Tianqi Chen, editors, Proceedings of the Sixth Conference on Machine Learning and Systems, MLSys 2023, Miami, FL, USA, June 4-8, 2023. mlsys.org, 2023. URL https://proceedings.mlsys.org/paper_files...
2023
-
[17]
Fastermoe: modeling and optimizing training of large-scale dynamic pre-trained models
Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. Fastermoe: modeling and optimizing training of large-scale dynamic pre-trained models. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 120–134, 2022
2022
-
[18]
Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, and Olli Saarikivi. Breaking the computation and communication abstraction barrier in distributed machine learning workloads. In Babak Falsafi, Michael Ferdman, Shan Lu, and Thomas F. Wenisch, editors,ASPLOS ’22:27th ACM Inter...
-
[19]
Chao Jin, Ziheng Jiang, Zhihao Bai, Zheng Zhong, Juncai Liu, Xiang Li, Ningxin Zheng, Xi Wang, Cong Xie, Qi Huang, Wen Heng, Yiyuan Ma, Wenlei Bao, Size Zheng, Yanghua Peng, Haibin Lin, Xuanzhe Liu, Xin Jin, and Xin Liu. Megascale-moe: Large-scale communication-efficient training of mixture-of-experts models in production. CoRR, abs/2505.11432, 2025. doi:...
-
[20]
InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23)
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace, editors,Proceedings of the 29th Symposium on Operating Systems P...
-
[21]
June 7, 2025.DOI:10.48550/arXiv.2410.06511
Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, and Stratos Idreos. Torchtitan: One-stop pytorch native solution for production ready llm pre-training, 2024. URLhttps://arxiv.org/abs/2410.06511
-
[22]
Netmoe: Accelerating moe training through dynamic sample placement
Xinyi Liu, Yujie Wang, Fangcheng Fu, Xupeng Miao, Shenhan Zhu, Xiaonan Nie, and Bin Cui. Netmoe: Accelerating moe training through dynamic sample placement. InThe Thirteenth International Conference 22 on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https: //openreview.net/forum?id=1qP3lsatCR
2025
-
[23]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. Deepseek-vl: Towards real-world vision-language understanding.CoRR, abs/2403.05525, 2024. doi: 10.48550/ARXIV.2403.05525. URL https://doi.org/10.48550/arXiv.2403.05525
work page internal anchor Pith review doi:10.48550/arxiv.2403.05525 2024
-
[24]
Efficient large-scale language model training on GPU clusters using megatron-lm
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on GPU clusters using megatron-lm. In Bronis R. deSupinski, MaryW.Hall, andToddGamblin, editors, ...
2021
-
[25]
doi: 10.1145/3458817.3476209. URLhttps://doi.org/10.1145/3458817.3476209
-
[26]
cuBLAS, 2022
NVIDIA. cuBLAS, 2022. URLhttps://developer.nvidia.com/cublas
2022
-
[27]
Cutlass, 2022
Nvidia. Cutlass, 2022. URLhttps://github.com/NVIDIA/cutlass
2022
-
[28]
Transformer Engine, 2022
NVIDIA. Transformer Engine, 2022. URLhttps://github.com/NVIDIA/TransformerEngine
2022
-
[29]
Nvidia collective communications library.https://developer.nvidia.com/nccl, 2024
NVIDIA. Nvidia collective communications library.https://developer.nvidia.com/nccl, 2024
2024
-
[30]
NVSHMEM, 2025
NVIDIA. NVSHMEM, 2025. URLhttps://docs.nvidia.com/nvshmem/api/using.html
2025
-
[31]
Cudnn, 2026
Nvidia. Cudnn, 2026. URLhttps://developer.nvidia.com/cudnn
2026
-
[32]
Cutile, 2026
Nvidia. Cutile, 2026. URLhttps://github.com/NVIDIA/cutile-python
2026
-
[33]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman P. Ama- rasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13,Seattle, WA, USA, June 16-19, 2013, pages...
-
[34]
Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation AI scale
Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation AI scale. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors,International Confere...
2022
-
[35]
TACCL: guiding collective algorithm synthesis using communication sketches
Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi. TACCL: guiding collective algorithm synthesis using communication sketches. In Mahesh Balakrishnan and Manya Ghobadi, editors,20th USENIX Symposium on NetworkedSystems Design and Implementation, NSDI 2023, Boston, MA, April 17-19...
2023
-
[36]
Msccl++: Rethinking gpu communication abstractions for cutting-edge ai applications, 2025
Aashaka Shah, Abhinav Jangda, Binyang Li, Caio Rocha, Changho Hwang, Jithin Jose, Madan Musuvathi, Olli Saarikivi, Peng Cheng, Qinghua Zhou, Roshan Dathathri, Saeed Maleki, and Ziyue Yang. Msccl++: Rethinking gpu communication abstractions for cutting-edge ai applications, 2025. URLhttps://arxiv.org/abs/2504.09014
-
[37]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Look ma, no bubbles! designing a low-latency megakernel for LLAMA-1B.https://hazyresearch.stanford
Benjamin Spector, Jordan Juravsky, Stuart Sul, Owen Dugan, Dylan Lim, Dan Fu, Simran Arora, and Chris Ré. Look ma, no bubbles! designing a low-latency megakernel for LLAMA-1B.https://hazyresearch.stanford. edu/blog/2025-05-27-no-bubbles, 2025. Hazy Research Blog
2025
-
[39]
Tilelang, 2025
TileLang-Team. Tilelang, 2025. URLhttps://github.com/tile-ai/tilelang. 24
2025
-
[40]
Philippe Tillet, Hsiang-Tsung Kung, and David D. Cox. Triton: an intermediate language and compiler for tiled neural network computations. In Tim Mattson, Abdullah Muzahid, and Armando Solar-Lezama, editors, Proceedings of the 3rd ACM SIGPLAN International Workshopon Machine Learning and Programming Languages, MAPL@PLDI 2019, Phoenix, AZ, USA, June 22, 20...
-
[41]
Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552,
Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr 2: Visual causal flow, 2026. URLhttps://arxiv.org/ abs/2601.20552
-
[42]
Mirage: A multi-level superoptimizer for tensor programs
Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia. Mirage: A multi-level superoptimizer for tensor programs. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), Boston, MA, July 2025. USENIX Association
2025
-
[43]
Hetermoe: Efficient training of mixture-of-experts models on heterogeneous gpus
Yongji Wu, Xueshen Liu, Shuowei Jin, Ceyu Xu, Feng Qian, Z. Morley Mao, Matthew Lentz, Danyang Zhuo, and Ion Stoica. Hetermoe: Efficient training of mixture-of-experts models on heterogeneous gpus. CoRR, abs/2504.03871, 2025. doi: 10.48550/ARXIV.2504.03871. URLhttps://doi.org/10.48550/arXiv.2504.03871
-
[44]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
-
[45]
Weihao Yang, Hao Huang, Donglei Wu, Ningke Li, Yanqi Pan, Qiyang Zheng, Wen Xia, Shiyi Li, and Qiang Wang. Hybridep: Scaling expert parallelism to cross-datacenter scenario via hybrid expert/data transmission.CoRR, abs/2510.19470, 2025. doi: 10.48550/ARXIV.2510.19470. URLhttps://doi.org/10.48550/arXiv.2510.19470
-
[46]
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. Flashinfer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005, 2025. URLhttps://arxiv.org/abs/2501.01005
-
[47]
Moeblaze: Breaking the memory wall for efficient moe training on modern gpus, 2026
Jiyuan Zhang, Yining Liu, Siqi Yan, Lisen Deng, Jennifer Cao, Shuqi Yang, Min Ni, Bi Xue, and Shen Li. Moeblaze: Breaking the memory wall for efficient moe training on modern gpus, 2026. URLhttps://arxiv.org/ abs/2601.05296
-
[48]
Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, and Xin Liu. Comet: Fine-grained computation-communication overlapping for mixture-of-experts. CoRR, abs/2502.19811, 2025. doi: 10.48550/ARXIV.2502.19811. URL https://doi.org/10.48550/arXiv.2502.19811
-
[49]
Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y. Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezhong Qiu, Bo...
work page internal anchor Pith review doi:10.48550/arxiv.2510.26692 2025
-
[50]
Deepep: an efficient expert-parallel communication library.https://github.com/deepseek-ai/ DeepEP, 2025
Chenggang Zhao, Shangyan Zhou, Liyue Zhang, Chengqi Deng, Zhean Xu, Yuxuan Liu, Kuai Yu, Jiashi Li, and Liang Zhao. Deepep: an efficient expert-parallel communication library.https://github.com/deepseek-ai/ DeepEP, 2025
2025
-
[51]
Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chenhui Huang, Tianqi Li, Haojie Duanmu, Renze Chen, Ruifan Xu, Yifan Guo, Ningxin Zheng, Ziheng Jiang, Xinyi Di, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Liqiang Lu, Yun Liang, Jidong Zhai, and Xin Liu. Triton-distributed: Programming overlapping kernels on distributed ai systems with the...
-
[52]
Size Zheng, Jin Fang, Xuegui Zheng, Qi Hou, Wenlei Bao, Ningxin Zheng, Ziheng Jiang, Dongyang Wang, Jianxi 25 Ye, Haibin Lin, Li-Wen Chang, and Xin Liu. Tilelink: Generating efficient compute-communication overlapping kernels using tile-centric primitives, 2025. URLhttps://arxiv.org/abs/2503.20313
-
[53]
Megascale- infer: Serving mixture-of-experts at scale with disaggregated expert parallelism
Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, et al. Megascale-infer: Serving mixture-of-experts at scale with disaggregated expert parallelism. arXiv preprint arXiv:2504.02263, 2025. 26
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.