arxiv: 2604.01621 · v2 · submitted 2026-04-02 · 💻 cs.DC · cs.AI

Recognition: no theorem link

DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72

Dongxu Yang, Jintao Peng, June Yang, Kefeng Duan, Tianyu Zhang, Wanqian Li, Xianjie Qiao, Xiaoming Chen, Ze Long, Zongfei Jing

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:23 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords LLM inferencedistributed parallelismMoE modelsmulti-GPU executionweight offloadingasynchronous prefetchTensorRT-LLM

0 comments

The pith

DWDP lets each GPU run LLM inference independently by fetching MoE weights on demand instead of synchronizing across ranks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Distributed Weight Data Parallelism to run large language model inference across multiple GPUs while preserving data-parallel execution. It offloads mixture-of-experts weights to peer GPUs and fetches missing experts asynchronously, removing the need for layer-wise collective synchronization. This independence reduces sensitivity to workload imbalance. Two supporting optimizations manage split weights and perform asynchronous prefetch. On GB200 NVL72 hardware with DeepSeek-R1, the approach yields an 8.8 percent gain in output tokens per second per GPU at comparable tokens per second per user in the 20-100 range under 8K input and 1K output lengths.

Core claim

DWDP preserves data-parallel execution while offloading MoE weights across peer GPUs and fetching missing experts on demand. By removing collective inter-rank synchronization, each GPU progresses independently. Split-weight management and asynchronous remote-weight prefetch keep the added overhead low enough to deliver an 8.8 percent improvement in end-to-end output TPS/GPU at comparable TPS/user in the 20-100 serving range for 8K input and 1K output sequences.

What carries the argument

Distributed Weight Data Parallelism (DWDP), which offloads MoE weights across peer GPUs and performs on-demand asynchronous fetching to eliminate inter-rank synchronization barriers.

If this is right

Each GPU can advance inference steps without waiting for collective barriers, lowering the impact of workload imbalance.
Output tokens per second per GPU rises 8.8 percent while tokens per second per user stays comparable in the tested serving band.
The same data-parallel layout continues to work after weights are split, with only the prefetch mechanism added.
Implementation in TensorRT-LLM on GB200 NVL72 hardware confirms the gains hold for DeepSeek-R1 under the stated sequence lengths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The independent-progress property may scale more gracefully to larger GPU counts than synchronization-heavy methods once interconnect latency grows.
Similar on-demand weight fetching could be tested on other MoE architectures to check whether the 8.8 percent range generalizes beyond the evaluated model.
If prefetch overhead proves sensitive to activation skew, a lightweight prediction of next-expert usage could further reduce stalls.

Load-bearing premise

The overhead of split-weight management and asynchronous remote-weight prefetch stays low enough across varying expert activation patterns and sequence lengths to preserve the reported gains without creating new bottlenecks.

What would settle it

Run the same workload with highly skewed expert activation patterns or sequence lengths beyond 8K and measure whether the 8.8 percent TPS/GPU gain disappears or turns negative.

Figures

Figures reproduced from arXiv: 2604.01621 by Dongxu Yang, Jintao Peng, June Yang, Kefeng Duan, Tianyu Zhang, Wanqian Li, Xianjie Qiao, Xiaoming Chen, Ze Long, Zongfei Jing.

**Figure 1.** Figure 1: Synchronization overhead caused by workload imbalance in DEP. (a) Illustration of how requestlevel and weight-level imbalance are translated into waiting time in DEP. (b) Kernel breakdown quantifying the synchronization overhead caused by imbalance under DEP. Configuration: DeepSeek-R1 on GB200 with input sequence length/output sequence length (ISL/OSL) = 8K/1 and input ratio 0.8, meaning that the input l… view at source ↗

**Figure 2.** Figure 2: Overview of DWDP with DWDP group size 4. 2 Methodology Overview [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Roofline-based preliminary analysis for the DeepSeek-R1 context phase on GB200, comparing DWDP4 against DEP4 at batch size 1. The two subplots separately show the compute-to-prefetch ratio and the DEP-to-DWDP runtime ratio. The dashed line at 𝑦 = 1 marks the boundary where prefetch can be fully hidden and where DWDP begins to outperform DEP. We observe that DWDP begins to outperform DEP at around 16K token… view at source ↗

**Figure 4.** Figure 4: Nsight Systems trace showing many-to-one source-side communication contention in DWDP under max_num_tokens= 16384 and input sequence lengths ranging from 4K to 8K. Multiple destination ranks concurrently pull remote weights from the same source rank, so the source-side copy engine serializes these requests and exposes compute bubbles [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: End-to-end Pareto frontier comparison between baseline and DWDP [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: shows a copy engine (CE) on the destination graphics processing unit (GPU) pulling weights from a peer GPU over NVLink. The CE is a dedicated data-movement engine, so it does not occupy Streaming Multiprocessor (SM)-based computation resources. However, the transfer still traverses the Network-onChip (NoC), Level-2 (L2) cache, and dynamic random-access memory (DRAM) on both GPUs, while local SM kernels is… view at source ↗

**Figure 7.** Figure 7: Illustration of the three communication patterns used in our overlap study. Intermittent Compute inserts large idle gaps between DeepSeek R1 attention modules to maximize power headroom, Long-Duration Overlap uses the longest communication windows while still preserving gaps between compute modules, and Short-Duration Overlap mimics the real DWDP workload with tightly scheduled compute and smaller communic… view at source ↗

**Figure 8.** Figure 8: shows the same trend, indicating that attention-kernel time tracks GPU frequency across all three patterns [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Large language model (LLM) inference increasingly depends on multi-GPU execution, yet existing inference parallelization strategies require layer-wise inter-rank synchronization, making end-to-end performance sensitive to workload imbalance. We present DWDP (Distributed Weight Data Parallelism), an inference parallelization strategy that preserves data-parallel execution while offloading MoE weights across peer GPUs and fetching missing experts on demand. By removing collective inter-rank synchronization, DWDP allows each GPU to progress independently. We further address the practical overheads of this design with two optimizations for split-weight management and asynchronous remote-weight prefetch. Implemented in TensorRT-LLM and evaluated with DeepSeek-R1 on GB200 NVL72, DWDP improves end-to-end output TPS/GPU by 8.8% at comparable TPS/user in the 20-100 TPS/user serving range under 8K input sequence length and 1K output sequence length.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DWDP gets a real 8.8% output TPS/GPU lift on DeepSeek-R1 by dropping layer sync and fetching experts on demand, but the prefetch overhead is barely tested.

read the letter

Hi colleague, the main takeaway is that DWDP improves end-to-end output tokens per second per GPU by 8.8% at comparable per-user latency on DeepSeek-R1 running on GB200 NVL72. It does this by keeping data-parallel execution, spreading MoE weights across ranks, and pulling missing experts asynchronously instead of forcing collective barriers at every layer. The two optimizations for split-weight bookkeeping and remote prefetch are the practical pieces that make the approach workable. The TensorRT-LLM implementation and the concrete hardware run give the result more grounding than most inference papers that stop at simulation. The core claim is plausible because removing synchronization directly attacks the imbalance that expert routing creates in large MoE models. The soft spots sit in the experimental detail. The reported gain is measured only at 8K input and 1K output lengths inside a narrow 20-100 TPS/user band. There is no ablation on prefetch hit rate, network contention when many ranks request the same expert, or what happens when activation entropy rises or sequences lengthen. If the async fetches cannot stay hidden under those conditions, the independent-progress advantage shrinks. Baseline comparisons are also described at a high level, so it is hard to isolate exactly how much of the 8.8% comes from the new strategy versus other tuning. This paper is aimed at engineers who already run large MoE serving workloads on dense GPU clusters and are looking for incremental throughput wins without rewriting the whole stack. A reader in that position can extract usable design choices and a measured data point. It deserves peer review because the hardware result is on real production hardware and the problem it targets is genuine, even though tighter validation on the prefetch assumption would make the claims stronger.

Referee Report

2 major / 1 minor

Summary. The paper introduces DWDP, a distributed weight data parallelism strategy for LLM inference that offloads MoE weights across peer GPUs and fetches missing experts on demand, thereby removing layer-wise inter-rank synchronization to allow independent GPU progress. It adds optimizations for split-weight management and asynchronous remote-weight prefetch, and reports an 8.8% gain in end-to-end output TPS/GPU (at comparable TPS/user) for DeepSeek-R1 on GB200 NVL72 under 8K input / 1K output sequences in the 20-100 TPS/user range, implemented in TensorRT-LLM.

Significance. If the measured gain holds under broader conditions, the approach could meaningfully improve utilization for large-scale MoE inference by eliminating synchronization bottlenecks. The concrete hardware measurement on NVL72 with a production framework is a strength that grounds the claim in real-system data.

major comments (2)

[Evaluation] Evaluation section: the 8.8% TPS/GPU improvement is presented without reported baseline details (e.g., exact tensor-parallel or expert-parallel configurations used for comparison), run-to-run variance, or workload-generation methodology. This makes the central empirical claim difficult to verify or reproduce.
[Design and Optimizations] Design and Optimizations sections: the claim that split-weight bookkeeping and async remote prefetch add negligible latency rests on the untested assumption that they remain low across non-uniform expert activations and varying sequence lengths. No ablation on prefetch hit rate, activation entropy, or network contention is described, leaving the net-gain argument load-bearing but unsupported at the operating points that matter most for DeepSeek-R1.

minor comments (1)

[Abstract] Abstract and introduction: the phrase 'comparable TPS/user' should be defined quantitatively (e.g., within what tolerance) to allow readers to interpret the reported operating range.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below with clarifications and indicate the revisions we will make.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the 8.8% TPS/GPU improvement is presented without reported baseline details (e.g., exact tensor-parallel or expert-parallel configurations used for comparison), run-to-run variance, or workload-generation methodology. This makes the central empirical claim difficult to verify or reproduce.

Authors: We agree that the evaluation lacks sufficient detail on the baseline. The baseline is standard TensorRT-LLM expert parallelism (no DWDP) with the same tensor-parallel degree as the DWDP runs. In the revised manuscript we will explicitly state the exact TP/EP configuration, report standard deviations from three repeated runs, and describe the workload as synthetic requests with fixed 8K input / 1K output lengths drawn from the same distribution. These additions will make the 8.8% TPS/GPU claim verifiable. revision: yes
Referee: [Design and Optimizations] Design and Optimizations sections: the claim that split-weight bookkeeping and async remote prefetch add negligible latency rests on the untested assumption that they remain low across non-uniform expert activations and varying sequence lengths. No ablation on prefetch hit rate, activation entropy, or network contention is described, leaving the net-gain argument load-bearing but unsupported at the operating points that matter most for DeepSeek-R1.

Authors: The referee is correct that no explicit ablations are provided. We will revise the Design section to report the observed prefetch hit rates (typically >92% under DeepSeek-R1 activation patterns) and to explain why the asynchronous prefetch hides latency for the measured sequence lengths. Full sweeps over activation entropy and network contention would require new experiments outside the current evaluation scope; we therefore treat this as a partial revision and will note the limitation in the text. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical hardware measurement of implementation gains

full rationale

The paper introduces DWDP as a practical inference parallelization strategy for MoE models, implemented in TensorRT-LLM and benchmarked on GB200 NVL72 hardware with DeepSeek-R1. The central result is a measured 8.8% improvement in end-to-end output TPS/GPU under fixed sequence lengths and load ranges. No equations, first-principles derivations, or parameter-fitting steps are described that could reduce the reported gain to a self-referential definition, fitted input, or self-citation chain. The performance claim rests on direct execution measurements rather than any mathematical reduction to inputs, making the derivation chain self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard properties of MoE sparsity and GPU interconnect capabilities; no new entities or fitted constants are introduced in the abstract.

axioms (1)

domain assumption MoE models exhibit sufficient expert sparsity to make on-demand fetching practical without excessive communication
Implicit in the design choice for DeepSeek-R1 and similar models

pith-pipeline@v0.9.0 · 5485 in / 1219 out tokens · 47618 ms · 2026-05-13T21:23:26.558820+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 4 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D

R. Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D. Hardy, and Thomas L. Griffiths. When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1.arXiv preprint arXiv:2410.01792, 2024

work page arXiv 2024
[3]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Barbarians at the gate: How AI is upending systems research

Audrey Cheng, Shu Liu, Melissa Pan, et al. Barbarians at the gate: How AI is upending systems research. arXiv preprint arXiv:2510.06189, 2025

work page arXiv 2025
[5]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Kimi-K2.5

Moonshot AI. Kimi-K2.5. https://huggingface.co/moonshotai/Kimi-K2.5, 2026. Accessed: March 30, 2026

work page 2026
[7]

Qwen3-Coder-480B-A35B-Instruct

Qwen Team. Qwen3-Coder-480B-A35B-Instruct. https://huggingface.co/Qwen/ Qwen3-Coder-480B-A35B-Instruct, 2025. Accessed: March 30, 2026. 12 Distributed Weight Data Parallelism

work page 2025
[8]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5 Team, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, et al. GLM-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale. InProceedings of ICML’22, 2022

work page 2022
[10]

Tutel: Adaptive mixture-of-experts at scale

Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, et al. Tutel: Adaptive mixture-of-experts at scale. InProceedings of MLSys’23, 2023

work page 2023
[11]

Scalable training of mixture-of-experts models with megatron core, 2026

Zijie Yan, Hongxiao Bai, Xin Yao, et al. Scalable training of mixture-of-experts models with megatron core, 2026

work page 2026
[12]

Efficient large-scale language model training on GPU clusters using Megatron- LM

Deepak Narayanan et al. Efficient large-scale language model training on GPU clusters using Megatron- LM. InProceedings of SC’21, 2021

work page 2021
[13]

GPipe: Efficient training of giant neural networks using pipeline parallelism

Yanping Huang et al. GPipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in NeurIPS, 2019

work page 2019
[14]

Mooncake: Trading more storage for less computation — a KVCache-centric architecture for serving LLM chatbot

Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: Trading more storage for less computation — a KVCache-centric architecture for serving LLM chatbot. InProceedings of FAST’25, 2025

work page 2025
[15]

Strata: Hierarchical context caching for long context language model serving, 2025

Zhiqiang Xie, Ziyi Xu, Mark Zhao, Yuwei An, Vikram Sharma Mailthody, Scott Mahlke, Michael Garland, and Christos Kozyrakis. Strata: Hierarchical context caching for long context language model serving, 2025

work page 2025
[16]

Preble: Efficient distributed prompt scheduling for LLM serving

Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, and Yiying Zhang. Preble: Efficient distributed prompt scheduling for LLM serving. InProceedings of OSDI’24, 2024

work page 2024
[17]

SGLang v0.4: Zero-overhead batch scheduler, cache-aware load balancer, faster structured outputs

The SGLang Team. SGLang v0.4: Zero-overhead batch scheduler, cache-aware load balancer, faster structured outputs. https://lmsys.org/blog/2024-12-04-sglang-v0-4/ , 2024. Accessed: March 30, 2026

work page 2024
[18]

ADP balance strategy

NVIDIA TensorRT-LLM Team. ADP balance strategy. https://nvidia.github.io/TensorRT-LLM/ blogs/tech_blog/blog10_ADP_Balance_Strategy.html, 2026. Accessed: March 30, 2026

work page 2026
[19]

DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. InProceedings of OSDI’24, 2024

work page 2024
[20]

Splitwise: Efficient generative LLM inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative LLM inference using phase splitting. InProceedings of ISCA’24, 2024

work page 2024
[21]

Scaling expert parallelism in TensorRT-LLM (part 2: Performance sta- tus and optimization)

NVIDIA TensorRT-LLM Team. Scaling expert parallelism in TensorRT-LLM (part 2: Performance sta- tus and optimization). https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog8_Scaling_ Expert_Parallelism_in_TensorRT-LLM_part2.html, 2026. Accessed: March 30, 2026

work page 2026
[22]

NCCL: NVIDIA collective communications library

NVIDIA Corporation. NCCL: NVIDIA collective communications library. https://github.com/NVIDIA/ nccl, 2015. Accessed: March 30, 2026

work page 2015
[23]

Roofline: An insightful visual performance model for multicore architectures.Communications of the ACM, 52(4):65–76, 2009

Samuel Williams, Andrew Waterman, and David Patterson. Roofline: An insightful visual performance model for multicore architectures.Communications of the ACM, 52(4):65–76, 2009. doi: 10.1145/1498765. 1498785

work page doi:10.1145/1498765 2009
[24]

CuTe DSL

NVIDIA. CuTe DSL. https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/cute_dsl. html, 2026. NVIDIA CUTLASS documentation, accessed March 30, 2026. 13 Distributed Weight Data Parallelism

work page 2026
[25]

NVIDIA Blackwell Architecture Technical Overview

NVIDIA Corporation. NVIDIA Blackwell Architecture Technical Overview. https://resources. nvidia.com/en-us-blackwell-architecture/blackwell-architecture-technical-brief , 2024. Ac- cessed: March 30, 2026

work page 2024
[26]

TensorRT-LLM: A TensorRT toolbox for optimized large language model inference

NVIDIA. TensorRT-LLM: A TensorRT toolbox for optimized large language model inference. https: //github.com/NVIDIA/TensorRT-LLM, 2023. Accessed: March 30, 2026. 14 Appendix A In-Depth Analysis of Communication-Computation Interference InDWDP, remote-weight prefetch is overlapped with computation. The overlap introduces hardware contention and reduces the e...

work page 2023
[27]

Intermittent Compute: Large sleep gaps are inserted between DeepSeek R1 attention modules under the 16K-context setting, ensuring that each module runs with the best possible power headroom and without communication overlap

work page
[28]

Long-Duration Overlap (with Gaps): Each attention module overlaps with a long-duration CE communication task, yielding the longest overlap among the three patterns while still preserving large gaps between neighboring compute modules

work page
[29]

Figure 7.Illustration of the three communication patterns used in our overlap study

Short-Duration Overlap: This pattern is closest to the real DWDP workload, where tightly scheduled attention modules overlap with smaller communication tasks. Figure 7.Illustration of the three communication patterns used in our overlap study. Intermittent Compute inserts large idle gaps between DeepSeek R1 attention modules to maximize power headroom, Lo...

work page