pith. machine review for the scientific record. sign in

arxiv: 2605.00686 · v1 · submitted 2026-05-01 · 💻 cs.DC

Recognition: unknown

Eliminating Hidden Serialization in Multi-Node Megakernel Communication

Byungsoo Oh, Rachee Singh

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:32 UTC · model grok-4.3

classification 💻 cs.DC
keywords megakernelMixture-of-ExpertsRDMAproxy transportGPU communicationmulti-node inferenceserializationPerseus
0
0 comments X

The pith

Hidden serialization from per-tile fences in proxy RDMA regresses multi-node megakernel MoE performance by up to 10x; Perseus removes it via decoupled signaling and NIC-side ordering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Megakernels for Mixture-of-Experts inference fuse expert computation with tile-granular GPU-initiated communication to overlap transfers with compute, outperforming collectives on a single node. On multi-node setups using RDMA fabrics, this overlap fails and models regress sharply with added nodes because proxy-based transports impose an ordering fence after each tile transfer and its completion signal, draining the NIC pipeline and exposing inflated latency when per-expert compute is small. Perseus eliminates the serialization with two changes: decoupled signaling that batches fences at per-destination granularity to cut fence count by 8x, and NIC-side ordering that substitutes hardware fence flags for proxy stalls so the CPU proxy never blocks. On proxy transports this yields up to 10.3x end-to-end speedup, while Perseus on IBRC matches or exceeds IBGDA GPU-direct by up to 1.2x, demonstrating that serialization handling, not the choice of proxy versus GPU-direct transport, is what limits scaling.

Core claim

The paper establishes that the strict ordering between each tile's data transfer and its completion signal in proxy RDMA forces a fence that serializes concurrent transfers and grows costly with their number. Perseus counters this through decoupled signaling, which batches fences per destination, and NIC-side ordering, which moves enforcement to hardware fence flags on the NIC. As a result, communication-bound MoE models no longer place inflated network latency on the critical path, restoring the tile-granularity overlap benefit across nodes.

What carries the argument

Decoupled signaling (batching fences at per-destination granularity) combined with NIC-side ordering (replacing proxy stalls with hardware fence flags).

If this is right

  • Communication-bound MoE models regain the ability to hide tile transfers behind compute even when experts span multiple nodes.
  • Proxy-based RDMA transports become competitive with GPU-direct methods without requiring specialized hardware.
  • Performance no longer degrades with increasing node count for models whose per-expert compute cannot absorb extra latency.
  • The transport choice is no longer the primary limiter; software handling of ordering determines whether megakernel overlap succeeds at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fence-induced serialization may constrain other fine-grained GPU communication patterns that rely on proxy RDMA and per-operation signals.
  • NIC hardware support for fence flags could be extended to additional ordering primitives to benefit a wider set of distributed GPU workloads.
  • Applying the batching and hardware-flag techniques to non-MoE megakernels or other fused compute-communication kernels could improve multi-node scaling in additional domains.

Load-bearing premise

The assumption that hidden serialization from per-tile completion fences is the dominant general cause of the observed regression across models and node counts rather than interactions with specific network topologies or unmeasured overheads.

What would settle it

An experiment that varies the number of concurrent tile transfers on the same hardware and shows whether end-to-end latency scales directly with fence count before Perseus but not after would confirm or refute that serialization is the binding constraint.

Figures

Figures reproduced from arXiv: 2605.00686 by Byungsoo Oh, Rachee Singh.

Figure 1
Figure 1. Figure 1: Top: Weak scaling (ideal = 1.0×). Llama4 uses 16 ex￾perts, limiting EP-only scaling to 4 nodes (16 GPUs). Bottom: GPU streaming multiprocessor (SM) traces at 4 nodes. Llama4 shows dense overlap; Qwen3 shows frequent stalls due to delayed signals. NCCL [10], invoking one collective before expert compu￾tation and another after [24, 44]. A collective is a bulk syn￾chronous operation that requires all GPUs to … view at source ↗
Figure 2
Figure 2. Figure 2: Proxy–NIC interaction for 3 signaled transfers (p view at source ↗
Figure 3
Figure 3. Figure 3: Megakernel execution model. Each GPU runs a persistent view at source ↗
Figure 4
Figure 4. Figure 4: Proxy-based RDMA submission path. GPU threads en view at source ↗
Figure 5
Figure 5. Figure 5: Signaling overhead across message sizes, node counts, and concurrency levels. (a) Signaling efficiency, defined as coupled put+signal view at source ↗
Figure 6
Figure 6. Figure 6: CTA-to-proxy interaction for 96 remote experts across 12 view at source ↗
Figure 7
Figure 7. Figure 7: Effect of decoupled signaling (S=1K, 8 nodes). Left: FENCE count vs. group size. Right: latency vs. group size with cou￾pled baseline (dashed red). Contribution to speedup can be decom￾posed to: ❶ submission reordering (coupled→decoupled at group size 1), ❷ FENCE count reduction from 112 (group size 1)→ 4 (group size 28). 1 2 4 16 28 56 112 Decouple group size 6 8 10 12 Latency (ms) Vanilla Submit reorderi… view at source ↗
Figure 8
Figure 8. Figure 8: Combined effect of decoupled signaling and NIC-side view at source ↗
Figure 9
Figure 9. Figure 9: End-to-end forward latency: vanilla FlashMoE vs. Perseus. (a)–(c) Perlmutter (A100, Slingshot-11, Libfabric), 2–16 nodes (8–64 view at source ↗
Figure 10
Figure 10. Figure 10: Left: Ablation of speedup over vanilla for Qwen3-30B on Perlmutter. At larger node counts, per-fence cost increases and fence view at source ↗
Figure 11
Figure 11. Figure 11: Triton-distributed ALLTOALL microbenchmark (H=2048, Perlmutter). (a) Latency: vanilla is flat regardless of message size. (b) Overhead fraction (α/T): vanilla remains above 65%. (c) α comparison: Perseus achieves ∼99% reduction in serialization overhead. 2 4 8 Number of nodes 0 1 2 Speedup (a) S=1K 2 4 8 Number of nodes (b) S=8K Uniform Zipf (0.5) Zipf (1.0) Zipf (1.5) view at source ↗
Figure 12
Figure 12. Figure 12: Robustness to skewed expert routing (Qwen3-30B, 2–8 view at source ↗
Figure 14
Figure 14. Figure 14: Scalability recovery with Perseus. Top: microbenchmark view at source ↗
Figure 15
Figure 15. Figure 15: Latency decomposition into overhead (α) and transfer cost (β) on Libfabric. Left: vanilla α grows with node count; Perseus stays flat. Right: Perseus also reduces β by 25–38%. Computing α-β cost. To better understand where our per￾formance gains come from, we decompose communication latency using the α-β model [17]. The cost to send a message of size M is: T = α+β×M, where α captures fixed overhead indepe… view at source ↗
read the original abstract

Recent megakernel designs for Mixture-of-Experts (MoE) inference fuse expert computation with fine-grained, GPU-initiated communication into a single persistent GPU kernel, and outperform collective-based MoE on a single node by overlapping data transfer with compute at tile granularity. This benefit does not carry over cleanly to multi-node inference, where experts span many nodes connected by an RDMA fabric. Communication-bound MoE models regress by up to $10\times$ on 8 nodes, and the regression worsens with node count. We trace this regression to hidden serialization in proxy-based RDMA transports. The ordering requirement between each tile transfer and its completion signal forces a fence that drains the NIC pipeline, and its cost grows with the number of concurrent transfers. As a result, models whose per-expert compute is too small to absorb this inflated network latency expose communication on the critical path. We present \emph{Perseus}, which eliminates this serialization through two techniques. \emph{Decoupled signaling} batches fences at per-destination granularity, reducing fence count by $8\times$. \emph{NIC-side ordering} replaces proxy stalls with hardware fence flags, so the proxy never blocks. On proxy-based transports, Perseus achieves up to 10.3$\times$ end-to-end speedup. Perseus on IBRC matches or exceeds IBGDA GPU-direct by up to 1.2$\times$, which shows that serialization, rather than the choice between proxy-based and GPU-direct transport, is what bounds multi-node megakernel performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that multi-node megakernel communication for MoE inference suffers from up to 10× regression due to hidden serialization in proxy-based RDMA transports, specifically from fence operations that drain the NIC pipeline as the number of concurrent transfers increases. Perseus addresses this with decoupled signaling, which batches fences at per-destination granularity reducing their count by 8×, and NIC-side ordering that uses hardware fence flags to avoid proxy blocking. This results in up to 10.3× end-to-end speedup on proxy-based transports, and Perseus on IBRC matching or exceeding IBGDA GPU-direct by up to 1.2×, indicating that serialization is the primary bottleneck rather than the transport choice itself.

Significance. Should the central claims hold after addressing the comments below, the work would be significant for the field of distributed systems and high-performance computing. It offers a practical solution to a performance issue in emerging megakernel designs for large-scale inference, with potential to improve efficiency of MoE models on RDMA-connected clusters. The explicit comparison between proxy and GPU-direct transports, along with the proposed optimizations, provides valuable insights into communication bottlenecks and could influence future designs of GPU-initiated communication primitives.

major comments (2)
  1. [Abstract] Abstract: The claim that 'serialization, rather than the choice between proxy-based and GPU-direct transport, is what bounds multi-node megakernel performance' is load-bearing for the central thesis. It is supported by the reported 10.3× speedup and 1.2× match/exceed over IBGDA, but the manuscript does not provide ablations or controls that isolate fence-induced serialization from potential confounders such as network topology, concurrent transfer scaling, or megakernel-specific overheads that worsen with node count.
  2. [Abstract] Abstract: The mechanistic tracing of the regression to per-tile ordering requirements forcing NIC-pipeline-draining fences is plausible, but lacks a quantitative microbenchmark or breakdown (e.g., in the evaluation) showing how fence cost scales with concurrent transfers and accounts for the full 10× regression across models. Without this, the generality of the diagnosis remains under-supported.
minor comments (1)
  1. The abstract would be clearer if it briefly noted the specific models, node counts, and hardware configurations used for the 10.3× and 1.2× results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The feedback identifies opportunities to better isolate the contribution of serialization and to provide quantitative support for our mechanistic diagnosis. We address each major comment below and will revise the manuscript to incorporate additional evidence where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'serialization, rather than the choice between proxy-based and GPU-direct transport, is what bounds multi-node megakernel performance' is load-bearing for the central thesis. It is supported by the reported 10.3× speedup and 1.2× match/exceed over IBGDA, but the manuscript does not provide ablations or controls that isolate fence-induced serialization from potential confounders such as network topology, concurrent transfer scaling, or megakernel-specific overheads that worsen with node count.

    Authors: We agree that additional controls would strengthen isolation of fence-induced serialization from other factors. The existing end-to-end results (10.3× on proxy transports and matching/exceeding IBGDA) already indicate that transport choice is not the primary limiter, but we will add targeted ablations in the revised evaluation. These will fix network topology while varying concurrent transfer count, and separately measure megakernel overheads at different node scales. This will more directly attribute the regression to serialization. revision: partial

  2. Referee: [Abstract] Abstract: The mechanistic tracing of the regression to per-tile ordering requirements forcing NIC-pipeline-draining fences is plausible, but lacks a quantitative microbenchmark or breakdown (e.g., in the evaluation) showing how fence cost scales with concurrent transfers and accounts for the full 10× regression across models. Without this, the generality of the diagnosis remains under-supported.

    Authors: We concur that a dedicated microbenchmark would better quantify the scaling behavior. In the revised manuscript we will add a microbenchmark in the evaluation section that measures fence latency as a function of concurrent transfers. The new data will show the growth rate and demonstrate its contribution to the observed 10× regression across the evaluated MoE models, thereby supporting the generality of the diagnosis. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on implementation and measurements

full rationale

The paper derives its central claim—that hidden serialization from fence drains bounds multi-node megakernel performance—through concrete implementation of Perseus (decoupled signaling and NIC-side ordering) followed by direct end-to-end timing measurements on proxy transports and comparisons to IBGDA. No equations or parameters are defined in terms of the target result and then reused as predictions; no load-bearing uniqueness theorem or ansatz is imported via self-citation; the speedup numbers (10.3×, 1.2×) are reported outcomes rather than fitted inputs renamed as forecasts. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions about RDMA proxy behavior and GPU kernel semantics but introduces no free parameters, new entities, or ad-hoc axioms beyond domain-standard hardware models.

axioms (1)
  • domain assumption Proxy-based RDMA transports enforce per-transfer ordering via fences that drain the NIC pipeline.
    This is the diagnosed mechanism; the paper treats it as an established property of current proxy implementations.

pith-pipeline@v0.9.0 · 5574 in / 1340 out tokens · 49975 ms · 2026-05-09T18:32:20.971553+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 16 canonical work pages · 6 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt- oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

  2. [2]

    Mooncake EP & Mooncake Back- end

    KVCache AI. Mooncake EP & Mooncake Back- end. https://kvcache-ai.github.io/Mooncake/ python-api-reference/ep-backend.html

  3. [3]

    FlashMoE: Fast Distributed MoE in a Single Ker- nel

    Osayamen Jonathan Aimuyo, Byungsoo Oh, and Rachee Singh. FlashMoE: Fast Distributed MoE in a Single Ker- nel. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

  4. [4]

    ROCm DeepEP

    AMD. ROCm DeepEP. https://github.com/ROCm/ DeepEP

  5. [5]

    ROCm OpenSHMEM (rocSHMEM)

    AMD. ROCm OpenSHMEM (rocSHMEM). https: //github.com/ROCm/rocm-systems

  6. [6]

    Elastic Fabric Adapter

    AWS. Elastic Fabric Adapter. https://aws.amazon. com/hpc/efa/

  7. [7]

    Flux: Fast software-based communication overlap on gpus through kernel fusion.arXiv preprint arXiv:2406.06858,

    Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, et al. Flux: Fast software-based communication overlap on gpus through kernel fusion.arXiv preprint arXiv:2406.06858, 2024

  8. [8]

    Introducing openshmem: Shmem for the pgas community

    Barbara Chapman, Tony Curtis, Swaroop Pophale, Stephen Poole, Jeff Kuehn, Chuck Koelbel, and Lau- ren Smith. Introducing openshmem: Shmem for the pgas community. InProceedings of the fourth confer- ence on partitioned global address space programming model, pages 1–3, 2010

  9. [9]

    Cheng, Z

    Xinhao Cheng, Zhihao Zhang, Yu Zhou, Jianan Ji, Jinchen Jiang, Zepeng Zhao, Ziruo Xiao, Zihao Ye, Yingyi Huang, Ruihang Lai, et al. Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs.arXiv preprint arXiv:2512.22219, 2025

  10. [10]

    NVIDIA Collective Communica- tions Library (NCCL)

    NVIDIA Corporation. NVIDIA Collective Communica- tions Library (NCCL). https://developer.nvidia. com/nccl

  11. [11]

    Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems (NeurIPS 22), 35:16344–16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems (NeurIPS 22), 35:16344–16359, 2022

  12. [12]

    An in-depth analysis of the slingshot interconnect

    Daniele De Sensi, Salvatore Di Girolamo, Kim H McMa- hon, Duncan Roweth, and Torsten Hoefler. An in-depth analysis of the slingshot interconnect. InInternational Conference for High Performance Computing, Network- ing, Storage and Analysis (SC 20), pages 1–14, 2020

  13. [13]

    NCCL EP: Towards a Unified Expert Par- allel Communication API for NCCL.arXiv preprint arXiv:2603.13606, 2026

    Amos Goldman, Nimrod Boker, Maayan Sheraizin, Nim- rod Admoni, Artem Polyakov, Subhadeep Bhattacharya, Fan Yu, Kai Sun, Georgios Theodorakis, Hsin-Chun Yin, et al. NCCL EP: Towards a Unified Expert Par- allel Communication API for NCCL.arXiv preprint arXiv:2603.13606, 2026

  14. [14]

    A brief introduction to the openfabrics interfaces-a new network api for maximizing high per- formance application efficiency

    Paul Grun, Sean Hefty, Sayantan Sur, David Goodell, Robert D Russell, Howard Pritchard, and Jeffrey M Squyres. A brief introduction to the openfabrics interfaces-a new network api for maximizing high per- formance application efficiency. In2015 IEEE 23rd Annual Symposium on High-Performance Interconnects (HOTI 15), pages 34–39, 2015

  15. [15]

    GPU-Initiated Networking for NCCL.arXiv preprint arXiv:2511.15076, 2025

    Khaled Hamidouche, John Bachan, Pak Markthub, Peter- Jan Gootzen, Elena Agostini, Sylvain Jeaugey, Aamir Shafi, Georgios Theodorakis, and Manjunath Gorentla Venkata. GPU-Initiated Networking for NCCL.arXiv preprint arXiv:2511.15076, 2025

  16. [16]

    Fastermoe: modeling and optimizing training of large-scale dy- namic pre-trained models

    Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. Fastermoe: modeling and optimizing training of large-scale dy- namic pre-trained models. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 22), pages 120–134, 2022

  17. [17]

    The communication challenge for MPP: Intel Paragon and Meiko CS-2.Parallel comput- ing, 20(3):389–398, 1994

    Roger W Hockney. The communication challenge for MPP: Intel Paragon and Meiko CS-2.Parallel comput- ing, 20(3):389–398, 1994

  18. [18]

    Flashoverlap: A lightweight design for efficiently over- lapping communication and computation

    Ke Hong, Xiuhong Li, Minxu Liu, Qiuli Mao, Tianqi Wu, Zixiao Huang, Lufang Chen, Zhong Wang, Yi- chong Zhang, Zhenhua Zhu, Guohao Dai, and Yu Wang. Flashoverlap: A lightweight design for efficiently over- lapping communication and computation. InProceed- ings of the Twenty-First European Conference on Com- puter Systems (EuroSys ’26), 2026

  19. [19]

    Msccl++: Rethinking gpu communication abstractions for ai inference

    Changho Hwang, Peng Cheng, Roshan Dathathri, Ab- hinav Jangda, Saeed Maleki, Madan Musuvathi, Olli Saarikivi, Aashaka Shah, Ziyue Yang, Binyang Li, et al. Msccl++: Rethinking gpu communication abstractions for ai inference. InProceedings of the 31st ACM Inter- national Conference on Architectural Support for Pro- gramming Languages and Operating Systems (...

  20. [20]

    Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems (MLSys 23), 5:269–287, 2023

    Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, et al. Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems (MLSys 23), 5:269–287, 2023

  21. [21]

    Lancet: Accelerating mixture- of-experts training via whole graph computation- communication overlapping.Proceedings of Machine Learning and Systems (MLSys 24), 6:74–86, 2024

    Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, and Yida Wang. Lancet: Accelerating mixture- of-experts training via whole graph computation- communication overlapping.Proceedings of Machine Learning and Systems (MLSys 24), 6:74–86, 2024

  22. [22]

    MegaScale-MoE:Large-ScaleCommunication- Efficient Training of Mixture-of-Experts Models in Production

    Chao Jin, Ziheng Jiang, Zhihao Bai, Zheng Zhong, Jun- cai Liu, Xiang Li, Ningxin Zheng, Xi Wang, Cong Xie, Qi Huang, et al. Megascale-moe: Large-scale communication-efficient training of mixture-of-experts models in production.arXiv preprint arXiv:2505.11432, 2025

  23. [23]

    Efficient memory man- agement for large language model serving with page- dattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP 23), pages 611– 626, 2023

  24. [24]

    GShard: Scaling gi- ant models with conditional computation and automatic sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, De- hao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling gi- ant models with conditional computation and automatic sharding. InInternational Conference on Learning Rep- resentations (ICLR), 2021

  25. [25]

    Eval- uating modern gpu interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect.IEEE Transactions on Parallel and Distributed Systems, 31(1):94–110, 2019

    Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R Tallent, and Kevin J Barker. Eval- uating modern gpu interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect.IEEE Transactions on Parallel and Distributed Systems, 31(1):94–110, 2019

  26. [26]

    Accelerating distributed {MoE} training and inference with lina

    Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. Accelerating distributed {MoE} training and inference with lina. In2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 945–959, 2023

  27. [27]

    fabric-lib: RDMA Point-to-Point Communication for LLM Systems

    Nandor Licker, Kevin Hu, Vladimir Zaytsev, and Lequn Chen. RDMA Point-to-Point Communication for LLM Systems.arXiv preprint arXiv:2510.27656, 2025

  28. [28]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. DeepSeek-V3 tech- nical report.arXiv preprint arXiv:2412.19437, 2024

  29. [29]

    Janus: A unified distributed training framework for sparse mixture-of-experts models

    Juncai Liu, Jessie Hui Wang, and Yimin Jiang. Janus: A unified distributed training framework for sparse mixture-of-experts models. InProceedings of the ACM SIGCOMM 2023 Conference, pages 486–498, 2023

  30. [30]

    ibv_wr_rdma_write_imm (IBV_WR API)

    Linux manual. ibv_wr_rdma_write_imm (IBV_WR API). https://man7.org/linux/man-pages/man3/ ibv_wr_rdma_write_imm.3.html

  31. [31]

    UCCL-EP: Portable Expert-Parallel Communication.arXiv preprint arXiv:2512.19849, 2025

    Ziming Mao, Yihan Zhang, Chihan Cui, Zhen Huang, Kaichao You, Zhongjie Chen, Zhiying Xu, Zhenyu Gu, Scott Shenker, Costin Raiciu, et al. UCCL-EP: Portable Expert-Parallel Communication.arXiv preprint arXiv:2512.19849, 2025

  32. [32]

    Improving Network Perfor- mance of HPC Systems Using NVIDIA Mag- num IO NVSHMEM and GPUDirect Async

    Pak Markthub, Jim Dinan, Sreeram Potluri, and Seth Howell. Improving Network Perfor- mance of HPC Systems Using NVIDIA Mag- num IO NVSHMEM and GPUDirect Async. https://developer.nvidia.com/blog/improving-network- performance-of-hpc-systems-using-nvidia-magnum-io- nvshmem-and-gpudirect-async/, 2022

  33. [33]

    Hugging Face model card: Llama-4-Scout- 17B-16E

    Meta. Hugging Face model card: Llama-4-Scout- 17B-16E. https://huggingface.co/meta-llama/ Llama-4-Scout-17B-16E, 2025

  34. [34]

    Perlmutter

    National Energy Research Scientific Computing Cen- ter. Perlmutter. https://docs.nersc.gov/systems/ perlmutter/architecture/

  35. [35]

    ConnectX NICs

    NVIDIA. ConnectX NICs. https://www.nvidia. com/en-us/networking/ethernet-adapters/

  36. [36]

    GPUDirect RDMA

    NVIDIA. GPUDirect RDMA. https://docs.nvidia. com/cuda/gpudirect-rdma/, 2024

  37. [37]

    Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack Scale Systems

    NVIDIA. Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack Scale Systems. https://developer.nvidia.com/blog/scaling-large-moe- models-with-wide-expert-parallelism-on-nvl72-rack- scale-systems/, 2025

  38. [38]

    NVIDIA OpenSHMEM Li- brary (NVSHMEM)

    NVIDIA Corporation. NVIDIA OpenSHMEM Li- brary (NVSHMEM). https://github.com/NVIDIA/ nvshmem

  39. [39]

    PyTorch: An imperative style, high- performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Al- ban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high- ...

  40. [40]

    T3: Transparent tracking & triggering for fine-grained overlap of com- pute & collectives

    Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, and Matthew D Sinclair. T3: Transparent tracking & triggering for fine-grained overlap of com- pute & collectives. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vol- ume 2, pages 1146–1164, 2024. 14

  41. [41]

    Optimizing distributed ml communi- cation with fused computation-collective operations

    Kishore Punniyamurthy, Khaled Hamidouche, and Brad- ford M Beckmann. Optimizing distributed ml communi- cation with fused computation-collective operations. In International Conference for High Performance Com- puting, Networking, Storage and Analysis (SC 24), pages 1–17, 2024

  42. [42]

    PyTorch Symmetric Memory

    PyTorch. PyTorch Symmetric Memory. https://docs.pytorch.org/docs/stable/ symmetric_memory.html

  43. [43]

    Chimera: Communication fusion for hybrid parallelism in large language models

    Le Qin, Junwei Cui, Weilin Cai, and Jiayi Huang. Chimera: Communication fusion for hybrid parallelism in large language models. InProceedings of the 52nd Annual International Symposium on Computer Architec- ture (ISCA 25), pages 498–513, 2025

  44. [44]

    DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale

    Samyam Rajbhandari, Conglong Li, Zhewei Yao, Min- jia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale. InInternational Confer- ence on Machine Learning (ICML), pages 18332–18346, 2022

  45. [45]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

  46. [46]

    Schemoe: An extensible mixture-of-experts distributed training system with tasks scheduling

    Shaohuai Shi, Xinglin Pan, Qiang Wang, Chengjian Liu, Xiaozhe Ren, Zhongzhe Hu, Yu Yang, Bo Li, and Xi- aowen Chu. Schemoe: An extensible mixture-of-experts distributed training system with tasks scheduling. In Proceedings of the Nineteenth European Conference on Computer Systems (EuroSys 24), pages 236–249, 2024

  47. [47]

    Collective Communication for 100k+ GPUs

    Min Si, Pavan Balaji, Yongzhou Chen, Ching-Hsiang Chu, Adi Gangidi, Saif Hasan, Subodh Iyengar, Dan Johnson, Bingzhe Liu, Regina Ren, et al. Collec- tive communication for 100k+ GPUs.arXiv preprint arXiv:2510.20171, 2025

  48. [48]

    Designing a Low-Latency Megakernel for Llama- 1B

    Benjamin Spector, Jordan Juravsky, Stuart Sul, Owen Dugan, Dylan Lim, Dan Fu, Simran Arora, and Chris Ré. Designing a Low-Latency Megakernel for Llama- 1B. https://hazyresearch.stanford.edu/blog/ 2025-05-27-no-bubbles, 2025

  49. [49]

    ParallelKittens: Systematic and Prac- tical Simplification of Multi-GPU AI Kernels.arXiv preprint arXiv:2511.13940, 2025

    Stuart H Sul, Simran Arora, Benjamin F Spector, and Christopher Ré. ParallelKittens: Systematic and Prac- tical Simplification of Multi-GPU AI Kernels.arXiv preprint arXiv:2511.13940, 2025

  50. [50]

    The Landscape of GPU-Centric Communication

    Didem Unat, Ilyas Turimbetov, Mohammed Kefah Taha Issa, Do˘gan Sa˘gbili, Flavio Vella, Daniele De Sensi, and Ismayil Ismayilov. The landscape of gpu-centric com- munication.arXiv preprint arXiv:2409.09874, 2024

  51. [51]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  52. [52]

    Scalable training of mixture-of-experts models with megatron core.arXiv preprint arXiv:2603.07685,

    Zijie Yan, Hongxiao Bai, Xin Yao, Dennis Liu, Tong Liu, Hongbin Liu, Pingtian Li, Evan Wu, Shiqing Fan, Li Tao, et al. Scalable training of mixture-of-experts models with megatron core.arXiv preprint arXiv:2603.07685, 2026

  53. [53]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

  54. [54]

    Optimizing Communica- tion for Mixture-of-Experts Training with Hybrid Expert Parallel

    Fan Yu, Tong Liu, and Kai Sun. Optimizing Communica- tion for Mixture-of-Experts Training with Hybrid Expert Parallel. https://developer.nvidia.com/blog/optimizing- communication-for-mixture-of-experts-training-with- hybrid-expert-parallel/, 2026

  55. [55]

    Comet: Fine-grained computation-communication overlapping for mixture- of-experts.Proceedings of Machine Learning and Sys- tems (MLSys 25), 7, 2025

    Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, et al. Comet: Fine-grained computation-communication overlapping for mixture- of-experts.Proceedings of Machine Learning and Sys- tems (MLSys 25), 7, 2025

  56. [56]

    DeepEP: an efficient expert- parallel communication library

    Chenggang Zhao, Shangyan Zhou, Liyue Zhang, Chengqi Deng, Zhean Xu, Yuxuan Liu, Kuai Yu, Ji- ashi Li, and Liang Zhao. DeepEP: an efficient expert- parallel communication library. https://github. com/deepseek-ai/DeepEP, 2025

  57. [57]

    Sglang: Efficient execution of structured language model pro- grams.Advances in Neural Information Processing Systems (NeurIPS 24), 37:62557–62583, 2024

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model pro- grams.Advances in Neural Information Processing Systems (NeurIPS 24), 37:62557–62583, 2024

  58. [58]

    Triton-distributed: Programming overlapping kernels on distributed ai systems with the triton compiler.arXiv preprint arXiv:2504.19442, 2025

    Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chenhui Huang, Tianqi Li, Haojie Duanmu, Renze Chen, Ruifan Xu, et al. Triton-distributed: Programming overlapping kernels on distributed ai systems with the triton compiler.arXiv preprint arXiv:2504.19442, 2025

  59. [59]

    Tilelink: Generating effi- cient compute-communication overlapping kernels us- ing tile-centric primitives.Proceedings of Machine Learning and Systems (MLSys 25), 7, 2025

    Size Zheng, Jin Fang, Xuegui Zheng, Qi Hou, Wenlei Bao, Ningxin Zheng, Ziheng Jiang, Dongyang Wang, Jianxi Ye, Haibin Lin, et al. Tilelink: Generating effi- cient compute-communication overlapping kernels us- ing tile-centric primitives.Proceedings of Machine Learning and Systems (MLSys 25), 7, 2025. 15

  60. [60]

    Megascale-infer: Efficient mixture-of-experts model serving with disag- gregated expert parallelism

    Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Ce- sar A Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, et al. Megascale-infer: Efficient mixture-of-experts model serving with disag- gregated expert parallelism. InProceedings of the ACM SIGCOMM 2025 Conference, pages 592–608, 2025

  61. [61]

    Ravenio books, 2016

    George Kingsley Zipf.Human behavior and the prin- ciple of least effort: An introduction to human ecology. Ravenio books, 2016. Appendix Aα-βanalysis: latency decomposition 0 20 Overhead (ms) 20 40 Transfer Cost (ms/MB) 2 4 8 16 Nodes 20 40 2 4 8 16 Nodes 20 30 (a) Qwen3-30B (b) GPT-OSS-120B Vanilla Perseus Figure 15: Latency decomposition into overhead (...