arxiv: 2605.00686 · v1 · submitted 2026-05-01 · 💻 cs.DC

Recognition: unknown

Eliminating Hidden Serialization in Multi-Node Megakernel Communication

Byungsoo Oh, Rachee Singh

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:32 UTC · model grok-4.3

classification 💻 cs.DC

keywords megakernelMixture-of-ExpertsRDMAproxy transportGPU communicationmulti-node inferenceserializationPerseus

0 comments

The pith

Hidden serialization from per-tile fences in proxy RDMA regresses multi-node megakernel MoE performance by up to 10x; Perseus removes it via decoupled signaling and NIC-side ordering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Megakernels for Mixture-of-Experts inference fuse expert computation with tile-granular GPU-initiated communication to overlap transfers with compute, outperforming collectives on a single node. On multi-node setups using RDMA fabrics, this overlap fails and models regress sharply with added nodes because proxy-based transports impose an ordering fence after each tile transfer and its completion signal, draining the NIC pipeline and exposing inflated latency when per-expert compute is small. Perseus eliminates the serialization with two changes: decoupled signaling that batches fences at per-destination granularity to cut fence count by 8x, and NIC-side ordering that substitutes hardware fence flags for proxy stalls so the CPU proxy never blocks. On proxy transports this yields up to 10.3x end-to-end speedup, while Perseus on IBRC matches or exceeds IBGDA GPU-direct by up to 1.2x, demonstrating that serialization handling, not the choice of proxy versus GPU-direct transport, is what limits scaling.

Core claim

The paper establishes that the strict ordering between each tile's data transfer and its completion signal in proxy RDMA forces a fence that serializes concurrent transfers and grows costly with their number. Perseus counters this through decoupled signaling, which batches fences per destination, and NIC-side ordering, which moves enforcement to hardware fence flags on the NIC. As a result, communication-bound MoE models no longer place inflated network latency on the critical path, restoring the tile-granularity overlap benefit across nodes.

What carries the argument

Decoupled signaling (batching fences at per-destination granularity) combined with NIC-side ordering (replacing proxy stalls with hardware fence flags).

If this is right

Communication-bound MoE models regain the ability to hide tile transfers behind compute even when experts span multiple nodes.
Proxy-based RDMA transports become competitive with GPU-direct methods without requiring specialized hardware.
Performance no longer degrades with increasing node count for models whose per-expert compute cannot absorb extra latency.
The transport choice is no longer the primary limiter; software handling of ordering determines whether megakernel overlap succeeds at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fence-induced serialization may constrain other fine-grained GPU communication patterns that rely on proxy RDMA and per-operation signals.
NIC hardware support for fence flags could be extended to additional ordering primitives to benefit a wider set of distributed GPU workloads.
Applying the batching and hardware-flag techniques to non-MoE megakernels or other fused compute-communication kernels could improve multi-node scaling in additional domains.

Load-bearing premise

The assumption that hidden serialization from per-tile completion fences is the dominant general cause of the observed regression across models and node counts rather than interactions with specific network topologies or unmeasured overheads.

What would settle it

An experiment that varies the number of concurrent tile transfers on the same hardware and shows whether end-to-end latency scales directly with fence count before Perseus but not after would confirm or refute that serialization is the binding constraint.

Figures

Figures reproduced from arXiv: 2605.00686 by Byungsoo Oh, Rachee Singh.

**Figure 1.** Figure 1: Top: Weak scaling (ideal = 1.0×). Llama4 uses 16 experts, limiting EP-only scaling to 4 nodes (16 GPUs). Bottom: GPU streaming multiprocessor (SM) traces at 4 nodes. Llama4 shows dense overlap; Qwen3 shows frequent stalls due to delayed signals. NCCL [10], invoking one collective before expert computation and another after [24, 44]. A collective is a bulk synchronous operation that requires all GPUs to … view at source ↗

**Figure 2.** Figure 2: Proxy–NIC interaction for 3 signaled transfers (p view at source ↗

**Figure 3.** Figure 3: Megakernel execution model. Each GPU runs a persistent view at source ↗

**Figure 4.** Figure 4: Proxy-based RDMA submission path. GPU threads en view at source ↗

**Figure 5.** Figure 5: Signaling overhead across message sizes, node counts, and concurrency levels. (a) Signaling efficiency, defined as coupled put+signal view at source ↗

**Figure 6.** Figure 6: CTA-to-proxy interaction for 96 remote experts across 12 view at source ↗

**Figure 7.** Figure 7: Effect of decoupled signaling (S=1K, 8 nodes). Left: FENCE count vs. group size. Right: latency vs. group size with coupled baseline (dashed red). Contribution to speedup can be decomposed to: ❶ submission reordering (coupled→decoupled at group size 1), ❷ FENCE count reduction from 112 (group size 1)→ 4 (group size 28). 1 2 4 16 28 56 112 Decouple group size 6 8 10 12 Latency (ms) Vanilla Submit reorderi… view at source ↗

**Figure 8.** Figure 8: Combined effect of decoupled signaling and NIC-side view at source ↗

**Figure 9.** Figure 9: End-to-end forward latency: vanilla FlashMoE vs. Perseus. (a)–(c) Perlmutter (A100, Slingshot-11, Libfabric), 2–16 nodes (8–64 view at source ↗

**Figure 10.** Figure 10: Left: Ablation of speedup over vanilla for Qwen3-30B on Perlmutter. At larger node counts, per-fence cost increases and fence view at source ↗

**Figure 11.** Figure 11: Triton-distributed ALLTOALL microbenchmark (H=2048, Perlmutter). (a) Latency: vanilla is flat regardless of message size. (b) Overhead fraction (α/T): vanilla remains above 65%. (c) α comparison: Perseus achieves ∼99% reduction in serialization overhead. 2 4 8 Number of nodes 0 1 2 Speedup (a) S=1K 2 4 8 Number of nodes (b) S=8K Uniform Zipf (0.5) Zipf (1.0) Zipf (1.5) view at source ↗

**Figure 12.** Figure 12: Robustness to skewed expert routing (Qwen3-30B, 2–8 view at source ↗

**Figure 14.** Figure 14: Scalability recovery with Perseus. Top: microbenchmark view at source ↗

**Figure 15.** Figure 15: Latency decomposition into overhead (α) and transfer cost (β) on Libfabric. Left: vanilla α grows with node count; Perseus stays flat. Right: Perseus also reduces β by 25–38%. Computing α-β cost. To better understand where our performance gains come from, we decompose communication latency using the α-β model [17]. The cost to send a message of size M is: T = α+β×M, where α captures fixed overhead indepe… view at source ↗

read the original abstract

Recent megakernel designs for Mixture-of-Experts (MoE) inference fuse expert computation with fine-grained, GPU-initiated communication into a single persistent GPU kernel, and outperform collective-based MoE on a single node by overlapping data transfer with compute at tile granularity. This benefit does not carry over cleanly to multi-node inference, where experts span many nodes connected by an RDMA fabric. Communication-bound MoE models regress by up to $10\times$ on 8 nodes, and the regression worsens with node count. We trace this regression to hidden serialization in proxy-based RDMA transports. The ordering requirement between each tile transfer and its completion signal forces a fence that drains the NIC pipeline, and its cost grows with the number of concurrent transfers. As a result, models whose per-expert compute is too small to absorb this inflated network latency expose communication on the critical path. We present \emph{Perseus}, which eliminates this serialization through two techniques. \emph{Decoupled signaling} batches fences at per-destination granularity, reducing fence count by $8\times$. \emph{NIC-side ordering} replaces proxy stalls with hardware fence flags, so the proxy never blocks. On proxy-based transports, Perseus achieves up to 10.3$\times$ end-to-end speedup. Perseus on IBRC matches or exceeds IBGDA GPU-direct by up to 1.2$\times$, which shows that serialization, rather than the choice between proxy-based and GPU-direct transport, is what bounds multi-node megakernel performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Perseus removes a serialization bottleneck in proxy RDMA for multi-node MoE megakernels and shows it can match GPU-direct performance.

read the letter

The main point is that megakernel MoE designs lose their single-node overlap advantage on multi-node clusters because proxy RDMA forces a fence after every tile transfer to preserve ordering with the completion signal. This drain scales with concurrent transfers and node count, putting communication on the critical path for small-expert models. Perseus fixes it with decoupled signaling that batches fences per destination and NIC-side hardware flags that let the proxy avoid blocking entirely. On proxy transports the result is up to 10.3× end-to-end speedup, and on IBRC it matches or exceeds IBGDA GPU-direct by 1.2×. That comparison is the cleanest part of the work because it directly tests the claim that serialization, not transport choice, was the limiter. The techniques themselves are a targeted but useful extension of existing RDMA primitives to the megakernel setting. The abstract gives clear before-and-after numbers, which is better than many systems papers. A remaining question is how completely the experiments isolate the fence cost from other variables such as network topology, concurrent transfer volume, or megakernel scheduling overhead. The worsening regression with node count is attributed to per-tile ordering, but without detailed ablations the paper needs to show those other factors were controlled or measured. If the full manuscript has those controls, the concern is minor; otherwise it weakens the generality of the diagnosis. This is a practical systems paper for people building or tuning distributed MoE inference on RDMA clusters. Readers who care about fine-grained GPU-initiated communication will find the mechanisms and measurements directly usable. It has enough concrete diagnosis and implementation to deserve peer review, though referees will focus on the experimental rigor around the causal claim.

Referee Report

2 major / 1 minor

Summary. The paper claims that multi-node megakernel communication for MoE inference suffers from up to 10× regression due to hidden serialization in proxy-based RDMA transports, specifically from fence operations that drain the NIC pipeline as the number of concurrent transfers increases. Perseus addresses this with decoupled signaling, which batches fences at per-destination granularity reducing their count by 8×, and NIC-side ordering that uses hardware fence flags to avoid proxy blocking. This results in up to 10.3× end-to-end speedup on proxy-based transports, and Perseus on IBRC matching or exceeding IBGDA GPU-direct by up to 1.2×, indicating that serialization is the primary bottleneck rather than the transport choice itself.

Significance. Should the central claims hold after addressing the comments below, the work would be significant for the field of distributed systems and high-performance computing. It offers a practical solution to a performance issue in emerging megakernel designs for large-scale inference, with potential to improve efficiency of MoE models on RDMA-connected clusters. The explicit comparison between proxy and GPU-direct transports, along with the proposed optimizations, provides valuable insights into communication bottlenecks and could influence future designs of GPU-initiated communication primitives.

major comments (2)

[Abstract] Abstract: The claim that 'serialization, rather than the choice between proxy-based and GPU-direct transport, is what bounds multi-node megakernel performance' is load-bearing for the central thesis. It is supported by the reported 10.3× speedup and 1.2× match/exceed over IBGDA, but the manuscript does not provide ablations or controls that isolate fence-induced serialization from potential confounders such as network topology, concurrent transfer scaling, or megakernel-specific overheads that worsen with node count.
[Abstract] Abstract: The mechanistic tracing of the regression to per-tile ordering requirements forcing NIC-pipeline-draining fences is plausible, but lacks a quantitative microbenchmark or breakdown (e.g., in the evaluation) showing how fence cost scales with concurrent transfers and accounts for the full 10× regression across models. Without this, the generality of the diagnosis remains under-supported.

minor comments (1)

The abstract would be clearer if it briefly noted the specific models, node counts, and hardware configurations used for the 10.3× and 1.2× results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The feedback identifies opportunities to better isolate the contribution of serialization and to provide quantitative support for our mechanistic diagnosis. We address each major comment below and will revise the manuscript to incorporate additional evidence where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'serialization, rather than the choice between proxy-based and GPU-direct transport, is what bounds multi-node megakernel performance' is load-bearing for the central thesis. It is supported by the reported 10.3× speedup and 1.2× match/exceed over IBGDA, but the manuscript does not provide ablations or controls that isolate fence-induced serialization from potential confounders such as network topology, concurrent transfer scaling, or megakernel-specific overheads that worsen with node count.

Authors: We agree that additional controls would strengthen isolation of fence-induced serialization from other factors. The existing end-to-end results (10.3× on proxy transports and matching/exceeding IBGDA) already indicate that transport choice is not the primary limiter, but we will add targeted ablations in the revised evaluation. These will fix network topology while varying concurrent transfer count, and separately measure megakernel overheads at different node scales. This will more directly attribute the regression to serialization. revision: partial
Referee: [Abstract] Abstract: The mechanistic tracing of the regression to per-tile ordering requirements forcing NIC-pipeline-draining fences is plausible, but lacks a quantitative microbenchmark or breakdown (e.g., in the evaluation) showing how fence cost scales with concurrent transfers and accounts for the full 10× regression across models. Without this, the generality of the diagnosis remains under-supported.

Authors: We concur that a dedicated microbenchmark would better quantify the scaling behavior. In the revised manuscript we will add a microbenchmark in the evaluation section that measures fence latency as a function of concurrent transfers. The new data will show the growth rate and demonstrate its contribution to the observed 10× regression across the evaluated MoE models, thereby supporting the generality of the diagnosis. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on implementation and measurements

full rationale

The paper derives its central claim—that hidden serialization from fence drains bounds multi-node megakernel performance—through concrete implementation of Perseus (decoupled signaling and NIC-side ordering) followed by direct end-to-end timing measurements on proxy transports and comparisons to IBGDA. No equations or parameters are defined in terms of the target result and then reused as predictions; no load-bearing uniqueness theorem or ansatz is imported via self-citation; the speedup numbers (10.3×, 1.2×) are reported outcomes rather than fitted inputs renamed as forecasts. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions about RDMA proxy behavior and GPU kernel semantics but introduces no free parameters, new entities, or ad-hoc axioms beyond domain-standard hardware models.

axioms (1)

domain assumption Proxy-based RDMA transports enforce per-transfer ordering via fences that drain the NIC pipeline.
This is the diagnosed mechanism; the paper treats it as an established property of current proxy implementations.

pith-pipeline@v0.9.0 · 5574 in / 1340 out tokens · 49975 ms · 2026-05-09T18:32:20.971553+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 16 canonical work pages · 6 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt- oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review arXiv 2025
[2]

Mooncake EP & Mooncake Back- end

KVCache AI. Mooncake EP & Mooncake Back- end. https://kvcache-ai.github.io/Mooncake/ python-api-reference/ep-backend.html
[3]

FlashMoE: Fast Distributed MoE in a Single Ker- nel

Osayamen Jonathan Aimuyo, Byungsoo Oh, and Rachee Singh. FlashMoE: Fast Distributed MoE in a Single Ker- nel. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

2025
[4]

ROCm DeepEP

AMD. ROCm DeepEP. https://github.com/ROCm/ DeepEP
[5]

ROCm OpenSHMEM (rocSHMEM)

AMD. ROCm OpenSHMEM (rocSHMEM). https: //github.com/ROCm/rocm-systems
[6]

Elastic Fabric Adapter

AWS. Elastic Fabric Adapter. https://aws.amazon. com/hpc/efa/
[7]

Flux: Fast software-based communication overlap on gpus through kernel fusion.arXiv preprint arXiv:2406.06858,

Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, et al. Flux: Fast software-based communication overlap on gpus through kernel fusion.arXiv preprint arXiv:2406.06858, 2024

work page arXiv 2024
[8]

Introducing openshmem: Shmem for the pgas community

Barbara Chapman, Tony Curtis, Swaroop Pophale, Stephen Poole, Jeff Kuehn, Chuck Koelbel, and Lau- ren Smith. Introducing openshmem: Shmem for the pgas community. InProceedings of the fourth confer- ence on partitioned global address space programming model, pages 1–3, 2010

2010
[9]

Cheng, Z

Xinhao Cheng, Zhihao Zhang, Yu Zhou, Jianan Ji, Jinchen Jiang, Zepeng Zhao, Ziruo Xiao, Zihao Ye, Yingyi Huang, Ruihang Lai, et al. Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs.arXiv preprint arXiv:2512.22219, 2025

work page arXiv 2025
[10]

NVIDIA Collective Communica- tions Library (NCCL)

NVIDIA Corporation. NVIDIA Collective Communica- tions Library (NCCL). https://developer.nvidia. com/nccl
[11]

Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems (NeurIPS 22), 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems (NeurIPS 22), 35:16344–16359, 2022

2022
[12]

An in-depth analysis of the slingshot interconnect

Daniele De Sensi, Salvatore Di Girolamo, Kim H McMa- hon, Duncan Roweth, and Torsten Hoefler. An in-depth analysis of the slingshot interconnect. InInternational Conference for High Performance Computing, Network- ing, Storage and Analysis (SC 20), pages 1–14, 2020

2020
[13]

NCCL EP: Towards a Unified Expert Par- allel Communication API for NCCL.arXiv preprint arXiv:2603.13606, 2026

Amos Goldman, Nimrod Boker, Maayan Sheraizin, Nim- rod Admoni, Artem Polyakov, Subhadeep Bhattacharya, Fan Yu, Kai Sun, Georgios Theodorakis, Hsin-Chun Yin, et al. NCCL EP: Towards a Unified Expert Par- allel Communication API for NCCL.arXiv preprint arXiv:2603.13606, 2026

work page arXiv 2026
[14]

A brief introduction to the openfabrics interfaces-a new network api for maximizing high per- formance application efficiency

Paul Grun, Sean Hefty, Sayantan Sur, David Goodell, Robert D Russell, Howard Pritchard, and Jeffrey M Squyres. A brief introduction to the openfabrics interfaces-a new network api for maximizing high per- formance application efficiency. In2015 IEEE 23rd Annual Symposium on High-Performance Interconnects (HOTI 15), pages 34–39, 2015

2015
[15]

GPU-Initiated Networking for NCCL.arXiv preprint arXiv:2511.15076, 2025

Khaled Hamidouche, John Bachan, Pak Markthub, Peter- Jan Gootzen, Elena Agostini, Sylvain Jeaugey, Aamir Shafi, Georgios Theodorakis, and Manjunath Gorentla Venkata. GPU-Initiated Networking for NCCL.arXiv preprint arXiv:2511.15076, 2025

work page arXiv 2025
[16]

Fastermoe: modeling and optimizing training of large-scale dy- namic pre-trained models

Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. Fastermoe: modeling and optimizing training of large-scale dy- namic pre-trained models. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 22), pages 120–134, 2022

2022
[17]

The communication challenge for MPP: Intel Paragon and Meiko CS-2.Parallel comput- ing, 20(3):389–398, 1994

Roger W Hockney. The communication challenge for MPP: Intel Paragon and Meiko CS-2.Parallel comput- ing, 20(3):389–398, 1994

1994
[18]

Flashoverlap: A lightweight design for efficiently over- lapping communication and computation

Ke Hong, Xiuhong Li, Minxu Liu, Qiuli Mao, Tianqi Wu, Zixiao Huang, Lufang Chen, Zhong Wang, Yi- chong Zhang, Zhenhua Zhu, Guohao Dai, and Yu Wang. Flashoverlap: A lightweight design for efficiently over- lapping communication and computation. InProceed- ings of the Twenty-First European Conference on Com- puter Systems (EuroSys ’26), 2026

2026
[19]

Msccl++: Rethinking gpu communication abstractions for ai inference

Changho Hwang, Peng Cheng, Roshan Dathathri, Ab- hinav Jangda, Saeed Maleki, Madan Musuvathi, Olli Saarikivi, Aashaka Shah, Ziyue Yang, Binyang Li, et al. Msccl++: Rethinking gpu communication abstractions for ai inference. InProceedings of the 31st ACM Inter- national Conference on Architectural Support for Pro- gramming Languages and Operating Systems (...

2026
[20]

Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems (MLSys 23), 5:269–287, 2023

Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, et al. Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems (MLSys 23), 5:269–287, 2023

2023
[21]

Lancet: Accelerating mixture- of-experts training via whole graph computation- communication overlapping.Proceedings of Machine Learning and Systems (MLSys 24), 6:74–86, 2024

Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, and Yida Wang. Lancet: Accelerating mixture- of-experts training via whole graph computation- communication overlapping.Proceedings of Machine Learning and Systems (MLSys 24), 6:74–86, 2024

2024
[22]

MegaScale-MoE:Large-ScaleCommunication- Efficient Training of Mixture-of-Experts Models in Production

Chao Jin, Ziheng Jiang, Zhihao Bai, Zheng Zhong, Jun- cai Liu, Xiang Li, Ningxin Zheng, Xi Wang, Cong Xie, Qi Huang, et al. Megascale-moe: Large-scale communication-efficient training of mixture-of-experts models in production.arXiv preprint arXiv:2505.11432, 2025

work page arXiv 2025
[23]

Efficient memory man- agement for large language model serving with page- dattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP 23), pages 611– 626, 2023

2023
[24]

GShard: Scaling gi- ant models with conditional computation and automatic sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, De- hao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling gi- ant models with conditional computation and automatic sharding. InInternational Conference on Learning Rep- resentations (ICLR), 2021

2021
[25]

Eval- uating modern gpu interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect.IEEE Transactions on Parallel and Distributed Systems, 31(1):94–110, 2019

Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R Tallent, and Kevin J Barker. Eval- uating modern gpu interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect.IEEE Transactions on Parallel and Distributed Systems, 31(1):94–110, 2019

2019
[26]

Accelerating distributed {MoE} training and inference with lina

Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. Accelerating distributed {MoE} training and inference with lina. In2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 945–959, 2023

2023
[27]

fabric-lib: RDMA Point-to-Point Communication for LLM Systems

Nandor Licker, Kevin Hu, Vladimir Zaytsev, and Lequn Chen. RDMA Point-to-Point Communication for LLM Systems.arXiv preprint arXiv:2510.27656, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. DeepSeek-V3 tech- nical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Janus: A unified distributed training framework for sparse mixture-of-experts models

Juncai Liu, Jessie Hui Wang, and Yimin Jiang. Janus: A unified distributed training framework for sparse mixture-of-experts models. InProceedings of the ACM SIGCOMM 2023 Conference, pages 486–498, 2023

2023
[30]

ibv_wr_rdma_write_imm (IBV_WR API)

Linux manual. ibv_wr_rdma_write_imm (IBV_WR API). https://man7.org/linux/man-pages/man3/ ibv_wr_rdma_write_imm.3.html
[31]

UCCL-EP: Portable Expert-Parallel Communication.arXiv preprint arXiv:2512.19849, 2025

Ziming Mao, Yihan Zhang, Chihan Cui, Zhen Huang, Kaichao You, Zhongjie Chen, Zhiying Xu, Zhenyu Gu, Scott Shenker, Costin Raiciu, et al. UCCL-EP: Portable Expert-Parallel Communication.arXiv preprint arXiv:2512.19849, 2025

work page arXiv 2025
[32]

Improving Network Perfor- mance of HPC Systems Using NVIDIA Mag- num IO NVSHMEM and GPUDirect Async

Pak Markthub, Jim Dinan, Sreeram Potluri, and Seth Howell. Improving Network Perfor- mance of HPC Systems Using NVIDIA Mag- num IO NVSHMEM and GPUDirect Async. https://developer.nvidia.com/blog/improving-network- performance-of-hpc-systems-using-nvidia-magnum-io- nvshmem-and-gpudirect-async/, 2022

2022
[33]

Hugging Face model card: Llama-4-Scout- 17B-16E

Meta. Hugging Face model card: Llama-4-Scout- 17B-16E. https://huggingface.co/meta-llama/ Llama-4-Scout-17B-16E, 2025

2025
[34]

Perlmutter

National Energy Research Scientific Computing Cen- ter. Perlmutter. https://docs.nersc.gov/systems/ perlmutter/architecture/
[35]

ConnectX NICs

NVIDIA. ConnectX NICs. https://www.nvidia. com/en-us/networking/ethernet-adapters/
[36]

GPUDirect RDMA

NVIDIA. GPUDirect RDMA. https://docs.nvidia. com/cuda/gpudirect-rdma/, 2024

2024
[37]

Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack Scale Systems

NVIDIA. Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack Scale Systems. https://developer.nvidia.com/blog/scaling-large-moe- models-with-wide-expert-parallelism-on-nvl72-rack- scale-systems/, 2025

2025
[38]

NVIDIA OpenSHMEM Li- brary (NVSHMEM)

NVIDIA Corporation. NVIDIA OpenSHMEM Li- brary (NVSHMEM). https://github.com/NVIDIA/ nvshmem
[39]

PyTorch: An imperative style, high- performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Al- ban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high- ...

2019
[40]

T3: Transparent tracking & triggering for fine-grained overlap of com- pute & collectives

Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, and Matthew D Sinclair. T3: Transparent tracking & triggering for fine-grained overlap of com- pute & collectives. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vol- ume 2, pages 1146–1164, 2024. 14

2024
[41]

Optimizing distributed ml communi- cation with fused computation-collective operations

Kishore Punniyamurthy, Khaled Hamidouche, and Brad- ford M Beckmann. Optimizing distributed ml communi- cation with fused computation-collective operations. In International Conference for High Performance Com- puting, Networking, Storage and Analysis (SC 24), pages 1–17, 2024

2024
[42]

PyTorch Symmetric Memory

PyTorch. PyTorch Symmetric Memory. https://docs.pytorch.org/docs/stable/ symmetric_memory.html
[43]

Chimera: Communication fusion for hybrid parallelism in large language models

Le Qin, Junwei Cui, Weilin Cai, and Jiayi Huang. Chimera: Communication fusion for hybrid parallelism in large language models. InProceedings of the 52nd Annual International Symposium on Computer Architec- ture (ISCA 25), pages 498–513, 2025

2025
[44]

DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Min- jia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale. InInternational Confer- ence on Machine Learning (ICML), pages 18332–18346, 2022

2022
[45]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[46]

Schemoe: An extensible mixture-of-experts distributed training system with tasks scheduling

Shaohuai Shi, Xinglin Pan, Qiang Wang, Chengjian Liu, Xiaozhe Ren, Zhongzhe Hu, Yu Yang, Bo Li, and Xi- aowen Chu. Schemoe: An extensible mixture-of-experts distributed training system with tasks scheduling. In Proceedings of the Nineteenth European Conference on Computer Systems (EuroSys 24), pages 236–249, 2024

2024
[47]

Collective Communication for 100k+ GPUs

Min Si, Pavan Balaji, Yongzhou Chen, Ching-Hsiang Chu, Adi Gangidi, Saif Hasan, Subodh Iyengar, Dan Johnson, Bingzhe Liu, Regina Ren, et al. Collec- tive communication for 100k+ GPUs.arXiv preprint arXiv:2510.20171, 2025

work page arXiv 2025
[48]

Designing a Low-Latency Megakernel for Llama- 1B

Benjamin Spector, Jordan Juravsky, Stuart Sul, Owen Dugan, Dylan Lim, Dan Fu, Simran Arora, and Chris Ré. Designing a Low-Latency Megakernel for Llama- 1B. https://hazyresearch.stanford.edu/blog/ 2025-05-27-no-bubbles, 2025

2025
[49]

ParallelKittens: Systematic and Prac- tical Simplification of Multi-GPU AI Kernels.arXiv preprint arXiv:2511.13940, 2025

Stuart H Sul, Simran Arora, Benjamin F Spector, and Christopher Ré. ParallelKittens: Systematic and Prac- tical Simplification of Multi-GPU AI Kernels.arXiv preprint arXiv:2511.13940, 2025

work page arXiv 2025
[50]

The Landscape of GPU-Centric Communication

Didem Unat, Ilyas Turimbetov, Mohammed Kefah Taha Issa, Do˘gan Sa˘gbili, Flavio Vella, Daniele De Sensi, and Ismayil Ismayilov. The landscape of gpu-centric com- munication.arXiv preprint arXiv:2409.09874, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[52]

Scalable training of mixture-of-experts models with megatron core.arXiv preprint arXiv:2603.07685,

Zijie Yan, Hongxiao Bai, Xin Yao, Dennis Liu, Tong Liu, Hongbin Liu, Pingtian Li, Evan Wu, Shiqing Fan, Li Tao, et al. Scalable training of mixture-of-experts models with megatron core.arXiv preprint arXiv:2603.07685, 2026

work page arXiv 2026
[53]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Optimizing Communica- tion for Mixture-of-Experts Training with Hybrid Expert Parallel

Fan Yu, Tong Liu, and Kai Sun. Optimizing Communica- tion for Mixture-of-Experts Training with Hybrid Expert Parallel. https://developer.nvidia.com/blog/optimizing- communication-for-mixture-of-experts-training-with- hybrid-expert-parallel/, 2026

2026
[55]

Comet: Fine-grained computation-communication overlapping for mixture- of-experts.Proceedings of Machine Learning and Sys- tems (MLSys 25), 7, 2025

Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, et al. Comet: Fine-grained computation-communication overlapping for mixture- of-experts.Proceedings of Machine Learning and Sys- tems (MLSys 25), 7, 2025

2025
[56]

DeepEP: an efficient expert- parallel communication library

Chenggang Zhao, Shangyan Zhou, Liyue Zhang, Chengqi Deng, Zhean Xu, Yuxuan Liu, Kuai Yu, Ji- ashi Li, and Liang Zhao. DeepEP: an efficient expert- parallel communication library. https://github. com/deepseek-ai/DeepEP, 2025

2025
[57]

Sglang: Efficient execution of structured language model pro- grams.Advances in Neural Information Processing Systems (NeurIPS 24), 37:62557–62583, 2024

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model pro- grams.Advances in Neural Information Processing Systems (NeurIPS 24), 37:62557–62583, 2024

2024
[58]

Triton-distributed: Programming overlapping kernels on distributed ai systems with the triton compiler.arXiv preprint arXiv:2504.19442, 2025

Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chenhui Huang, Tianqi Li, Haojie Duanmu, Renze Chen, Ruifan Xu, et al. Triton-distributed: Programming overlapping kernels on distributed ai systems with the triton compiler.arXiv preprint arXiv:2504.19442, 2025

work page arXiv 2025
[59]

Tilelink: Generating effi- cient compute-communication overlapping kernels us- ing tile-centric primitives.Proceedings of Machine Learning and Systems (MLSys 25), 7, 2025

Size Zheng, Jin Fang, Xuegui Zheng, Qi Hou, Wenlei Bao, Ningxin Zheng, Ziheng Jiang, Dongyang Wang, Jianxi Ye, Haibin Lin, et al. Tilelink: Generating effi- cient compute-communication overlapping kernels us- ing tile-centric primitives.Proceedings of Machine Learning and Systems (MLSys 25), 7, 2025. 15

2025
[60]

Megascale-infer: Efficient mixture-of-experts model serving with disag- gregated expert parallelism

Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Ce- sar A Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, et al. Megascale-infer: Efficient mixture-of-experts model serving with disag- gregated expert parallelism. InProceedings of the ACM SIGCOMM 2025 Conference, pages 592–608, 2025

2025
[61]

Ravenio books, 2016

George Kingsley Zipf.Human behavior and the prin- ciple of least effort: An introduction to human ecology. Ravenio books, 2016. Appendix Aα-βanalysis: latency decomposition 0 20 Overhead (ms) 20 40 Transfer Cost (ms/MB) 2 4 8 16 Nodes 20 40 2 4 8 16 Nodes 20 30 (a) Qwen3-30B (b) GPT-OSS-120B Vanilla Perseus Figure 15: Latency decomposition into overhead (...

2016