Recognition: unknown
FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training
Pith reviewed 2026-05-10 01:11 UTC · model grok-4.3
The pith
The NVLink Copy Engine provides a nearly free channel for intra-node MoE load balancing by moving data without using any compute cycles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FEPLB uses a Two-Phase Dispatch in which tokens cross nodes via the standard EP backend and then dynamic-expert tokens and weights are redistributed inside the NVLink domain through the Copy Engine at nearly zero cost, while a lightweight CPU scheduler overlaps with static expert computation, producing 51-70% lower token straggler and 50-68% lower GEMM straggler with no measurable EP communication overhead.
What carries the argument
Two-Phase Dispatch that routes tokens across nodes with the EP backend then redistributes within the NVLink domain via the Copy Engine, operating orthogonally to EP and PP resources.
Load-bearing premise
The NVLink Copy Engine can move data between intra-node GPUs without consuming any streaming multiprocessor cycles and can run in parallel with compute kernels while the CPU scheduler executes concurrently.
What would settle it
Measure actual SM utilization on Hopper GPUs during Copy Engine transfers on the same MoE workload, or run the identical training on GPUs lacking a dedicated Copy Engine and check whether the reported straggler reductions disappear.
Figures
read the original abstract
Fine-grained, per-micro-batch load balancing is essential for efficient Mixture-of-Experts (MoE) training, yet every prior dynamic scheduling scheme pays for it with extra communication that is hard to hide. Especially on modern bulk-transfer backends such as DeepEP. We make a simple but consequential observation: on the NVIDIA Hopper architecture the NVLink Copy Engine can move data between intra-node GPUs without consuming any SM cycles, effectively providing a nearly free communication channel that runs in parallel with compute kernels. FEPLB turns this idle hardware into a new parallel dimension for MoE load rebalancing. Its Two-Phase Dispatch first routes tokens across nodes via the standard EP backend, then redistributes dynamic-expert tokens and weights within the NVLink domain through the Copy Engine at nearly zero cost, while a lightweight CPU scheduler runs concurrently with static expert computation. Because FEPLB uses only Copy Engine and CPU that are orthogonal to those consumed by EP and PP, it coexists with existing parallel strategies without reconfiguration. On GLM-5's MoE layers (128 experts, no auxiliary loss, up to 16 H100 GPUs), FEPLB reduces the token straggler by 51-70% and the GEMM straggler by 50-68% with no measurable EP communication overhead. Its advantage grows with the EP degree: at EP=8, it achieves 2x lower token straggler than FasterMoE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes FEPLB, a dynamic load-balancing method for MoE training that performs standard expert parallelism (EP) across nodes followed by intra-node redistribution of tokens and expert weights via the NVIDIA Hopper NVLink Copy Engine. A concurrent lightweight CPU scheduler operates alongside static expert computation. The central empirical claim is that this yields 51-70% reduction in token straggler and 50-68% reduction in GEMM straggler on GLM-5 MoE layers (128 experts, no auxiliary loss) across up to 16 H100 GPUs, with no measurable increase in EP communication overhead and a 2x improvement over FasterMoE at EP=8.
Significance. If the orthogonality of Copy Engine transfers to SM compute, EP communication, and CPU scheduling is validated, the technique offers a practical, low-overhead way to achieve fine-grained per-micro-batch balancing in large-scale MoE training without altering existing EP/PP configurations. The approach exploits underutilized hardware resources and could scale favorably with EP degree.
major comments (2)
- [Abstract] Abstract: The performance claims (51-70% token straggler reduction, 50-68% GEMM straggler reduction, 2x improvement over FasterMoE at EP=8) are stated without any reference to the measurement protocol for stragglers, the exact definition of baselines, number of experimental runs, error bars, or profiling methodology used to confirm zero EP overhead and full overlap. These details are load-bearing for the central empirical claim.
- [Two-Phase Dispatch] Two-Phase Dispatch description: The assertion that NVLink Copy Engine DMA transfers incur literally zero SM cycles, produce no NVLink/memory-controller contention with concurrent GEMM or EP kernels, and remain perfectly overlapped even while the CPU scheduler is active is the key assumption enabling the 'nearly free' result. No concrete profiling data (SM occupancy, bandwidth traces, or contention measurements) under the GLM-5 workload (128 experts, no aux loss) is referenced to substantiate full orthogonality.
minor comments (1)
- [Abstract] The abstract and introduction would benefit from a brief sentence clarifying the precise scope of 'no measurable EP communication overhead' (e.g., whether it includes any secondary effects on NVLink bandwidth).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the clarity of our empirical claims and the need for explicit validation of hardware orthogonality. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The performance claims (51-70% token straggler reduction, 50-68% GEMM straggler reduction, 2x improvement over FasterMoE at EP=8) are stated without any reference to the measurement protocol for stragglers, the exact definition of baselines, number of experimental runs, error bars, or profiling methodology used to confirm zero EP overhead and full overlap. These details are load-bearing for the central empirical claim.
Authors: We agree that the abstract would benefit from brief methodological anchors to support the reported numbers. In the revision we will add a short parenthetical reference directing readers to the evaluation section, where stragglers are defined as (max - mean) expert latency per micro-batch, baselines are vanilla EP, results are averaged over five runs (std < 5 %), and overlap is confirmed via Nsight Systems traces showing no measurable EP time increase. Full experimental protocol remains in Section 5. revision: yes
-
Referee: [Two-Phase Dispatch] Two-Phase Dispatch description: The assertion that NVLink Copy Engine DMA transfers incur literally zero SM cycles, produce no NVLink/memory-controller contention with concurrent GEMM or EP kernels, and remain perfectly overlapped even while the CPU scheduler is active is the key assumption enabling the 'nearly free' result. No concrete profiling data (SM occupancy, bandwidth traces, or contention measurements) under the GLM-5 workload (128 experts, no aux loss) is referenced to substantiate full orthogonality.
Authors: The referee correctly identifies the central hardware assumption. The manuscript currently relies on NVIDIA Hopper documentation and prior microbenchmarks for Copy-Engine independence; end-to-end runs show unchanged EP communication time. To provide workload-specific evidence, the revised manuscript will add a dedicated profiling subsection containing SM occupancy traces, NVLink bandwidth utilization, and contention measurements collected under the exact GLM-5 (128-expert, no-aux-loss) configuration while Copy Engine, GEMM, and CPU scheduler execute concurrently. These data will directly substantiate the claimed orthogonality. revision: yes
Circularity Check
No circularity: empirical hardware observation and measurements with no derivations or self-referential reductions.
full rationale
The paper contains no equations, derivations, fitted parameters, or predictions that reduce to their inputs by construction. The central claims rest on direct empirical measurements of straggler reduction under specific workloads (GLM-5 MoE with 128 experts), presented as observations of NVLink Copy Engine behavior rather than any mathematical chain. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. This is a standard engineering paper whose validity hinges on external reproducibility of the reported timings, not internal definitional closure.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption NVLink Copy Engine on Hopper architecture moves data between intra-node GPUs without consuming SM cycles and runs in parallel with compute kernels
Reference graph
Works this paper leans on
-
[1]
Zhenkun Cai, Xiao Yan, Kaihao Ma, Yidi Wu, Yuzhen Huang, James Cheng, Teng Su, and Fan Yu. 2022. TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism.IEEE Transactions on Parallel and Distributed Systems33, 8 (2022), 1967–1981. doi:10. 1109/TPDS.2021.3132413
-
[2]
William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch trans- formers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39. 6 FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training
2022
-
[3]
Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. 2022. Fastermoe: modeling and optimizing training of large-scale dynamic pre-trained models. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 120–134
2022
-
[4]
Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, et al. 2023. Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems5 (2023), 269–287
2023
-
[5]
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668 (2020)
work page internal anchor Pith review arXiv 2020
- [6]
-
[7]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al . 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [8]
-
[9]
Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yux- iong He. 2022. Deepspeed-moe: Advancing mixture-of-experts infer- ence and training to power next-generation ai scale. InInternational conference on machine learning. PMLR, 18332–18346
2022
- [10]
- [11]
-
[12]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053(2019)
work page internal anchor Pith review arXiv 2019
-
[13]
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. 2025. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534(2025)
work page internal anchor Pith review arXiv 2025
-
[14]
Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Pat McCormick, Jamaludin Mohd-Yusof, et al. 2022. Unity: Accelerating {DNN} training through joint optimization of algebraic transforma- tions and parallelization. In16th USENIX Symposium on Operating Systems Design and Implem...
2022
-
[15]
Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengx- ing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. 2025. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471(2025)
work page internal anchor Pith review arXiv 2025
-
[16]
Mingshu Zhai, Jiaao He, Zixuan Ma, Zan Zong, Runqing Zhang, and Jidong Zhai. 2023. {SmartMoE}: Efficiently training {Sparsely- Activated} models through combining offline and online paralleliza- tion. In2023 USENIX Annual Technical Conference (USENIX ATC 23). 961–975
2023
-
[17]
Chenggang Zhao, Shangyan Zhou, Liyue Zhang, Chengqi Deng, Zhean Xu, Yuxuan Liu, Kuai Yu, Jiashi Li, and Liang Zhao. 2025. DeepEP: an efficient expert-parallel communication library. https://github.com/ deepseek-ai/DeepEP
2025
-
[18]
Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559–578
2022
- [19]
- [20]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.