arxiv: 2604.19654 · v1 · submitted 2026-04-21 · 💻 cs.DC

Recognition: unknown

FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training

Shuyao Qi , Haoyuan Liu , Shizhen Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:11 UTC · model grok-4.3

classification 💻 cs.DC

keywords Mixture of Expertsload balancingdistributed trainingCopy Engineexpert parallelismNVLinkHopper GPU

0 comments

The pith

The NVLink Copy Engine provides a nearly free channel for intra-node MoE load balancing by moving data without using any compute cycles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the dedicated Copy Engine on Hopper GPUs transfers tokens and weights between intra-node GPUs in parallel with compute kernels and without consuming streaming multiprocessor cycles. FEPLB applies this by first routing tokens across nodes with standard expert parallelism, then using the Copy Engine for dynamic redistribution within the NVLink domain while a CPU scheduler runs concurrently. This approach avoids the communication overhead of earlier dynamic balancing methods and integrates directly with existing pipeline and expert parallelism. On a 128-expert MoE model without auxiliary loss, the method cuts token and GEMM stragglers substantially across up to 16 GPUs. The gains increase as the expert parallelism degree rises.

Core claim

FEPLB uses a Two-Phase Dispatch in which tokens cross nodes via the standard EP backend and then dynamic-expert tokens and weights are redistributed inside the NVLink domain through the Copy Engine at nearly zero cost, while a lightweight CPU scheduler overlaps with static expert computation, producing 51-70% lower token straggler and 50-68% lower GEMM straggler with no measurable EP communication overhead.

What carries the argument

Two-Phase Dispatch that routes tokens across nodes with the EP backend then redistributes within the NVLink domain via the Copy Engine, operating orthogonally to EP and PP resources.

Load-bearing premise

The NVLink Copy Engine can move data between intra-node GPUs without consuming any streaming multiprocessor cycles and can run in parallel with compute kernels while the CPU scheduler executes concurrently.

What would settle it

Measure actual SM utilization on Hopper GPUs during Copy Engine transfers on the same MoE workload, or run the identical training on GPUs lacking a dedicated Copy Engine and check whether the reported straggler reductions disappear.

Figures

Figures reproduced from arXiv: 2604.19654 by Haoyuan Liu, Shizhen Zhao, Shuyao Qi.

**Figure 3.** Figure 3: FEPLB system architecture. The training loop [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: EP communication time (Dispatch and Com [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Token straggler (top) and GEMM straggler [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Token straggler as a function of the dynamic [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

read the original abstract

Fine-grained, per-micro-batch load balancing is essential for efficient Mixture-of-Experts (MoE) training, yet every prior dynamic scheduling scheme pays for it with extra communication that is hard to hide. Especially on modern bulk-transfer backends such as DeepEP. We make a simple but consequential observation: on the NVIDIA Hopper architecture the NVLink Copy Engine can move data between intra-node GPUs without consuming any SM cycles, effectively providing a nearly free communication channel that runs in parallel with compute kernels. FEPLB turns this idle hardware into a new parallel dimension for MoE load rebalancing. Its Two-Phase Dispatch first routes tokens across nodes via the standard EP backend, then redistributes dynamic-expert tokens and weights within the NVLink domain through the Copy Engine at nearly zero cost, while a lightweight CPU scheduler runs concurrently with static expert computation. Because FEPLB uses only Copy Engine and CPU that are orthogonal to those consumed by EP and PP, it coexists with existing parallel strategies without reconfiguration. On GLM-5's MoE layers (128 experts, no auxiliary loss, up to 16 H100 GPUs), FEPLB reduces the token straggler by 51-70% and the GEMM straggler by 50-68% with no measurable EP communication overhead. Its advantage grows with the EP degree: at EP=8, it achieves 2x lower token straggler than FasterMoE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes FEPLB, a dynamic load-balancing method for MoE training that performs standard expert parallelism (EP) across nodes followed by intra-node redistribution of tokens and expert weights via the NVIDIA Hopper NVLink Copy Engine. A concurrent lightweight CPU scheduler operates alongside static expert computation. The central empirical claim is that this yields 51-70% reduction in token straggler and 50-68% reduction in GEMM straggler on GLM-5 MoE layers (128 experts, no auxiliary loss) across up to 16 H100 GPUs, with no measurable increase in EP communication overhead and a 2x improvement over FasterMoE at EP=8.

Significance. If the orthogonality of Copy Engine transfers to SM compute, EP communication, and CPU scheduling is validated, the technique offers a practical, low-overhead way to achieve fine-grained per-micro-batch balancing in large-scale MoE training without altering existing EP/PP configurations. The approach exploits underutilized hardware resources and could scale favorably with EP degree.

major comments (2)

[Abstract] Abstract: The performance claims (51-70% token straggler reduction, 50-68% GEMM straggler reduction, 2x improvement over FasterMoE at EP=8) are stated without any reference to the measurement protocol for stragglers, the exact definition of baselines, number of experimental runs, error bars, or profiling methodology used to confirm zero EP overhead and full overlap. These details are load-bearing for the central empirical claim.
[Two-Phase Dispatch] Two-Phase Dispatch description: The assertion that NVLink Copy Engine DMA transfers incur literally zero SM cycles, produce no NVLink/memory-controller contention with concurrent GEMM or EP kernels, and remain perfectly overlapped even while the CPU scheduler is active is the key assumption enabling the 'nearly free' result. No concrete profiling data (SM occupancy, bandwidth traces, or contention measurements) under the GLM-5 workload (128 experts, no aux loss) is referenced to substantiate full orthogonality.

minor comments (1)

[Abstract] The abstract and introduction would benefit from a brief sentence clarifying the precise scope of 'no measurable EP communication overhead' (e.g., whether it includes any secondary effects on NVLink bandwidth).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the clarity of our empirical claims and the need for explicit validation of hardware orthogonality. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The performance claims (51-70% token straggler reduction, 50-68% GEMM straggler reduction, 2x improvement over FasterMoE at EP=8) are stated without any reference to the measurement protocol for stragglers, the exact definition of baselines, number of experimental runs, error bars, or profiling methodology used to confirm zero EP overhead and full overlap. These details are load-bearing for the central empirical claim.

Authors: We agree that the abstract would benefit from brief methodological anchors to support the reported numbers. In the revision we will add a short parenthetical reference directing readers to the evaluation section, where stragglers are defined as (max - mean) expert latency per micro-batch, baselines are vanilla EP, results are averaged over five runs (std < 5 %), and overlap is confirmed via Nsight Systems traces showing no measurable EP time increase. Full experimental protocol remains in Section 5. revision: yes
Referee: [Two-Phase Dispatch] Two-Phase Dispatch description: The assertion that NVLink Copy Engine DMA transfers incur literally zero SM cycles, produce no NVLink/memory-controller contention with concurrent GEMM or EP kernels, and remain perfectly overlapped even while the CPU scheduler is active is the key assumption enabling the 'nearly free' result. No concrete profiling data (SM occupancy, bandwidth traces, or contention measurements) under the GLM-5 workload (128 experts, no aux loss) is referenced to substantiate full orthogonality.

Authors: The referee correctly identifies the central hardware assumption. The manuscript currently relies on NVIDIA Hopper documentation and prior microbenchmarks for Copy-Engine independence; end-to-end runs show unchanged EP communication time. To provide workload-specific evidence, the revised manuscript will add a dedicated profiling subsection containing SM occupancy traces, NVLink bandwidth utilization, and contention measurements collected under the exact GLM-5 (128-expert, no-aux-loss) configuration while Copy Engine, GEMM, and CPU scheduler execute concurrently. These data will directly substantiate the claimed orthogonality. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical hardware observation and measurements with no derivations or self-referential reductions.

full rationale

The paper contains no equations, derivations, fitted parameters, or predictions that reduce to their inputs by construction. The central claims rest on direct empirical measurements of straggler reduction under specific workloads (GLM-5 MoE with 128 experts), presented as observations of NVLink Copy Engine behavior rather than any mathematical chain. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. This is a standard engineering paper whose validity hinges on external reproducibility of the reported timings, not internal definitional closure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim depends on unverified hardware behavior of the Copy Engine and concurrency assumptions; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption NVLink Copy Engine on Hopper architecture moves data between intra-node GPUs without consuming SM cycles and runs in parallel with compute kernels
Presented as a key observation enabling the nearly-free communication channel.

pith-pipeline@v0.9.0 · 5568 in / 1279 out tokens · 83923 ms · 2026-05-10T01:11:12.478246+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 12 canonical work pages · 5 internal anchors

[1]

Zhenkun Cai, Xiao Yan, Kaihao Ma, Yidi Wu, Yuzhen Huang, James Cheng, Teng Su, and Fan Yu. 2022. TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism.IEEE Transactions on Parallel and Distributed Systems33, 8 (2022), 1967–1981. doi:10. 1109/TPDS.2021.3132413

work page arXiv 2022
[2]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch trans- formers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39. 6 FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training

2022
[3]

Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. 2022. Fastermoe: modeling and optimizing training of large-scale dynamic pre-trained models. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 120–134

2022
[4]

Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, et al. 2023. Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems5 (2023), 269–287

2023
[5]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668 (2020)

work page internal anchor Pith review arXiv 2020
[6]

Zongbiao Li, Xiezhao Li, Yinghao Cui, Yijun Chen, Zhixuan Gu, Yux- uan Liu, Wenbo Zhu, Fei Jia, Ke Liu, Qifeng Li, et al. 2024. Automati- cally Planning Optimal Parallel Strategy for Large Language Models. arXiv preprint arXiv:2501.00254(2024)

work page arXiv 2024
[7]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al . 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. 2022. Galvatron: Efficient transformer training over multiple gpus using automatic parallelism.arXiv preprint arXiv:2211.13878(2022)

work page arXiv 2022
[9]

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yux- iong He. 2022. Deepspeed-moe: Advancing mixture-of-experts infer- ence and training to power next-generation ai scale. InInternational conference on machine learning. PMLR, 18332–18346

2022
[10]

Ruifeng She, Bowen Pang, Kai Li, Zehua Liu, and Tao Zhong. 2025. Automatic Operator-level Parallelism Planning for Distributed Deep Learning–A Mixed-Integer Programming Approach.arXiv preprint arXiv:2503.09357(2025)

work page arXiv 2025
[11]

Ziji Shi, Le Jiang, Ang Wang, Jie Zhang, Xianyan Jia, Yong Li, Chen- can Wu, Jialin Li, and Wei Lin. 2023. TAP: Accelerating large-scale DNN training through tensor automatic parallelisation.arXiv preprint arXiv:2302.00247(2023)

work page arXiv 2023
[12]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053(2019)

work page internal anchor Pith review arXiv 2019
[13]

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. 2025. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534(2025)

work page internal anchor Pith review arXiv 2025
[14]

Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Pat McCormick, Jamaludin Mohd-Yusof, et al. 2022. Unity: Accelerating {DNN} training through joint optimization of algebraic transforma- tions and parallelization. In16th USENIX Symposium on Operating Systems Design and Implem...

2022
[15]

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengx- ing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. 2025. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471(2025)

work page internal anchor Pith review arXiv 2025
[16]

Mingshu Zhai, Jiaao He, Zixuan Ma, Zan Zong, Runqing Zhang, and Jidong Zhai. 2023. {SmartMoE}: Efficiently training {Sparsely- Activated} models through combining offline and online paralleliza- tion. In2023 USENIX Annual Technical Conference (USENIX ATC 23). 961–975

2023
[17]

Chenggang Zhao, Shangyan Zhou, Liyue Zhang, Chengqi Deng, Zhean Xu, Yuxuan Liu, Kuai Yu, Jiashi Li, and Liang Zhao. 2025. DeepEP: an efficient expert-parallel communication library. https://github.com/ deepseek-ai/DeepEP

2025
[18]

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559–578

2022
[19]

Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chenhui Huang, Tianqi Li, Haojie Duanmu, Renze Chen, Ruifan Xu, et al. 2025. Triton-distributed: Programming overlapping kernels on distributed ai systems with the triton compiler.arXiv preprint arXiv:2504.19442 (2025)

work page arXiv 2025
[20]

Zhuoran Zhu, Chunyang Zhu, Hao Lin, Xu Fu, Yiming Zhou, Quanlu Zhang, Zhenhua Li, Feng Qian, Chao Yu, Boxun Li, et al. 2025. FUSCO: High-Performance Distributed Data Shuffling via Transformation- Communication Fusion.arXiv preprint arXiv:2512.22036(2025). 7

work page arXiv 2025