TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference

Nipun Kwatra; Raja Gond; Ramachandran Ramjee

arxiv: 2505.11329 · v5 · submitted 2025-05-16 · 💻 cs.DC · cs.LG

TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference

Raja Gond , Nipun Kwatra , Ramachandran Ramjee This is my paper

Pith reviewed 2026-05-22 14:18 UTC · model grok-4.3

classification 💻 cs.DC cs.LG

keywords distributed LLM inferencetensor parallelismcompute-communication overlapAllReduceRMSNormGPU kernelsLLM serving

0 comments

The pith

TokenWeave enables efficient compute-communication overlap for tensor-parallel LLM inference at token lengths as small as 1024.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how fusing AllReduce with RMSNorm into one kernel lets communication overlap with computation in tensor-parallel LLM serving even at low token counts. Existing systems turn overlap off for small workloads because splitting tasks usually adds more overhead than it saves and uses up too many GPU resources. The new kernel runs on modern GPUs with only a handful of streaming multiprocessors by using special hardware features for joint communication and normalization. If this holds, serving systems could deliver faster responses without forcing larger batches or removing parallelism.

Core claim

TokenWeave is the first system to enable efficient compute-communication overlap for tensor-parallel model inference for token lengths as small as 1024. It identifies RMSNorm as a key operation and optimizes it together with communication through a novel fused AllReduce-RMSNorm kernel. This kernel leverages the NVSHARP/Multimem feature on Hopper and Blackwell GPUs to perform both tasks jointly using only 2-8 streaming multiprocessors on an 8xH100 DGX system, delivering up to 1.28x latency speedup and up to 1.19x higher throughput across models and workloads.

What carries the argument

The fused AllReduce-RMSNorm kernel that uses NVSHARP/Multimem to jointly execute communication and RMSNorm while allocating only 2-8 streaming multiprocessors.

If this is right

Tensor-parallel serving can keep low token counts per iteration while cutting communication overheads that reach 20 percent on NVLink.
Latency drops and throughput rises across multiple models without changing the overall serving setup.
In some cases the system outperforms an equivalent model that has all communication removed entirely.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The fusion idea could be tested on other normalization or activation steps to reduce overhead in additional layers.
Systems running on older GPUs might need alternative low-SM methods to reach similar overlap for small token lengths.

Load-bearing premise

The performance gains require NVSHARP or Multimem hardware support on Hopper or Blackwell GPUs and assume that dedicating only a few SMs to the fused kernel leaves enough resources for the rest of the model.

What would settle it

Running the same workloads on GPUs without NVSHARP/Multimem support and checking whether the reported latency and throughput improvements over baseline still appear.

Figures

Figures reproduced from arXiv: 2505.11329 by Nipun Kwatra, Raja Gond, Ramachandran Ramjee.

**Figure 1.** Figure 1: AllReduce communication and RMSNorm overheads for three models versus sequence length on an 8×H100 DGX system (median across 12 runs; GPU clocks are locked to the TDP frequency (Prescott, 2022)). Despite NVLink and NVSHARP, communication overheads range from 9–23%. RMSNorm overheads are also non-trivial, ranging from 4–9%. bers are reported with GPU clocks set to their TDP frequency (Prescott, 2022) to … view at source ↗

**Figure 2.** Figure 2: Inference latency of Llama-3.3-70B on an 8×H100 DGX system for various sequence lengths. vLLM-Multimem corresponds to vLLM with an optimized AllReduce implementation using Multimem (NVIDIA, 2023) and NVSHARP (NVIDIA, 2024) support. vLLM-nocomm is a counterfactual baseline corresponding to only the computation time without any communication. The dotted lines show performance normalized to the vLLM-Multimem… view at source ↗

**Figure 3.** Figure 3: Selective enabling of splitting/overlap in TokenWeave. At each iteration, the num tokens in the batch is checked against a token threshold. Full TokenWeave with splitting and overlap is enabled only for higher values of num tokens. For smaller num tokens, where splitting can result in higher overheads, we only enable the fused AllReduce–RMSNorm kernel. The method applies uniformly to prefill-only, mixed, a… view at source ↗

**Figure 4.** Figure 4: Splitting AllReduce (AR) into ReduceScatter (RS) and AllGather (AG) can result in non-trivial overheads. Shown are the absolute times and the relative performance (line plots) of these operations on an 8×H100 DGX system for varying sequence lengths. All runs are with a hidden size of 8192 using bf16. 0 20 Time (us) 0.97x 1.06x 1.01x 1.06x 1.10x 1.00x Sequence Length: 64 0 100 0.73x 1.04x 1.07x 1.06x 1.05x … view at source ↗

**Figure 5.** Figure 5: Multimem-based AllReduce implementations require very few SMs. Shown is the performance of the Multimem AllReduce kernel with different numbers of SMs for varying sequence lengths (hidden size 8192, bf16). In most cases, 4–8 SMs are enough to achieve near-optimal performance. service-level objectives (SLOs) of interactive workloads. Moreover, distributed inference is more efficient in many cases, as it al… view at source ↗

**Figure 6.** Figure 6: Large collective operations are more efficient. Shown is the bandwidth for ReduceScatter (RS) on an 8×H100 DGX system for varying sequence lengths (hidden size 8192, bf16). Larger tensors result in much better bandwidth, demonstrating that splitting input into smaller parts results in overheads. seen in [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Overview of TokenWeave: (a) Vanilla Tensor Parallelism: all compute and communication operations are performed sequentially. (b) TokenWeave: the input batch is partitioned into two splits. AllReduce is fused with RMSNorm, and computation of one split overlaps with communication of the other split. Separate compute and communication streams weave to orchestrate the overlap [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 8.** Figure 8: Benefits of smart-splitting: Execution of a kernel with ten CTAs on a hypothetical GPU with four SMs. Assuming each CTA exclusively occupies one SM, the kernel executes in multiple waves as SM resources become available. ter), NanoFlow overlaps GPU compute, HBM traffic, and NVLink transfers. The approach, however, relies on high batch sizes for breaking the input batch into sufficiently sized nano-batches,… view at source ↗

**Figure 10.** Figure 10: Our fused AllReduce–RMSNorm kernel performs optimally with very few SMs. We show the latency of the kernel under varying numbers of SMs for different sequence lengths (hidden size 8192 on an 8×H100 DGX, bf16). Using 8 SMs is close to optimal in most cases. containing 132 CTAs (exactly one full wave) and the other split with 168 CTAs (one full wave and one partial wave). This method effectively maintains … view at source ↗

**Figure 11.** Figure 11: TokenWeave throughput gains for end-to-end workload traces. We show the measured throughput for ShareGPT, arXiv, as well as fixed (input, output)-length traces for different models on 8×H100 DGX. (arXiv) (ShareGPT) (512, 128) (1024, 128) 0 10000 20000 1.19x 1.20x 1.20x 1.21x CS: 1024 (8xH100) (arXiv) (ShareGPT) (512, 128) (1024, 128) 1.15x 1.17x 1.21x 1.24x CS: 2048 (8xH100) (arXiv) (ShareGPT) (512, 128) … view at source ↗

**Figure 12.** Figure 12: TokenWeave throughput gains for end-to-end traces under chunk size variation. We show the measured throughput for ShareGPT, arXiv, as well as fixed (input, output)-length traces for Llama-3.3-70B on 8×H100 DGX. Chunk size varied from 1024–8192. 5 EXPERIMENTAL EVALUATION We evaluate TokenWeave across a range of popular models and workload settings, and under multiple tensor parallelism (TP) configurations.… view at source ↗

**Figure 13.** Figure 13: TokenWeave latency gains. Execution times of prefill requests with varying sequence lengths for different models on 8×H100. TokenWeave is close to or better than the theoretical vLLM-nocomm baseline with zero communication overhead, showing that TokenWeave not only recovers all communication overhead, but provides additional gains due to RMSNorm optimization. 1K 2K 4K 8K 16K Seq Length 0 2000 4000 6000 80… view at source ↗

**Figure 14.** Figure 14: Single-layer latency for Llama-3.3-70B on an 8×H100 DGX system. Numbers at the top represent normalized performance compared to vLLM-Multimem. While TileLink ends up with an overhead at small sequence lengths, TokenWeave consistently provides high gains over the entire sequence length range. vert output tokens back to text performed on the CPU). Baselines: We first compare against the vLLM 0.8.5 implem… view at source ↗

**Figure 15.** Figure 15: NanoFlow throughput for end-to-end workload traces under fixed (input, output)-length traces for Llama-3.3-70B on an 8×H100 DGX. nanoflow-full corresponds to full NanoFlow, while nanoflow-frameworkonly disables NanoFlow (both nanobatching and overlap) but uses their custom serving framework. sizes enabling lower TBT values but also resulting in lower throughput. In the second experiment, we vary the chunk… view at source ↗

**Figure 16.** Figure 16: We compare TokenWeave-fuseonly and full TokenWeave against the vLLM-Multimem baseline. Execution times are shown for prefill requests with varying sequence lengths for different models on 8×H100. TokenWeave-fuseonly provides gains due to the elimination of redundancy in RMSNorm computation and intermediate memory accesses, while TokenWeave provides additional gains from the compute-communication overlap. … view at source ↗

**Figure 17.** Figure 17: Ablation results. TokenWeave-fuseonly uses only the fused kernel. TokenWeave-equalsplit enables token splitting and overlap but does not apply smart-splitting. Experiments are conducted on an 8×H100 DGX system. We evaluate the adapted NanoFlow implementation under two conditions: with NanoFlow enabled and disabled within its own serving stack [PITH_FULL_IMAGE:figures/full_fig_p012_17.png] view at source ↗

**Figure 18.** Figure 18: presents the implementation of our fused AllReduce–RMSNorm CUDA kernel. Each CTA processes a subset of tokens, performs inter-GPU reduction using the multimem ld reduce add primitive, computes the local variance, and applies normalization before storing results via multimem st. By eliminating intermediate memory accesses and offloading reduction to NVSwitch, this fused implementation achieves lower HBM co… view at source ↗

**Figure 19.** Figure 19: Shown are the execution times for prefill requests with varying sequence lengths for Qwen3-235B-A22B on an 8×H100 DGX system. shown, the results stay similar to the case of fixed batch size and varying sequence lengths ( [PITH_FULL_IMAGE:figures/full_fig_p016_19.png] view at source ↗

**Figure 20.** Figure 20: TokenWeave throughput gains for end-to-end workload traces. Shown are throughput measurements across fixed (input, output)-length traces, as well as ShareGPT and arXiv traces, for two models on 4×H100. 512 1K 2K 4K 8K 16K 32K 64K 0 2000 4000 6000 8000 Time (ms) 1.04x 1.14x 1.11x 1.13x 1.16x 1.14x 1.12x 1.11x Llama-3.3-70B-Instruct (4x H100) 512 1K 2K 4K 8K 16K 32K 64K 0 2000 4000 6000 1.04x 1.14x 1.12x 1.… view at source ↗

**Figure 21.** Figure 21: TokenWeave latency gains. Shown are the execution times for prefill requests with varying sequence lengths for different models on 4×H100. 1 2 4 8 16 32 64 0 250 500 750 1000 1250 1500 Time (ms) 1.05x 1.21x 1.19x 1.23x 1.26x 1.29x 1.31x Llama-3.3-70B-Instruct (8x H100) 1 2 4 8 16 32 64 0 250 500 750 1000 1250 1500 1.06x 1.15x 1.13x 1.22x 1.23x 1.25x 1.26x Qwen2.5-72B-Instruct (8x H100) 1 2 4 8 16 32 64 0 … view at source ↗

**Figure 22.** Figure 22: TokenWeave latency gains. Shown are the execution times for prefill requests with varying batch sizes and sequence length of 512 for different models on (a) 8×H100 and (b) 4×H100. In almost all cases, TokenWeave is close to or better than the theoretical vLLM-nocomm baseline with zero communication overhead, showing that TokenWeave not only recovers all communication overhead but also provides additional … view at source ↗

**Figure 23.** Figure 23: TokenWeave Fused AllReduce–RMSNorm kernel ablation. We compare TokenWeave-fuseonly and full TokenWeave against the vLLM-Multimem baseline. Execution times are shown for prefill requests with varying sequence lengths for different models on 4×H100. TokenWeave-fuseonly provides gains due to the elimination of redundancy in RMSNorm computation and intermediate memory accesses, while TokenWeave provides addit… view at source ↗

**Figure 24.** Figure 24: TokenWeave Fused AllReduce–RMSNorm kernel ablation. Shown are the execution times of prefill requests with varying batch sizes, with the sequence length fixed at 512, for different models on (a) 8×H100 and (b) 4×H100. TokenWeave-fuseonly provides gains due to the elimination of redundancy in RMSNorm computation and intermediate memory accesses, while TokenWeave provides additional gains through compute-co… view at source ↗

**Figure 25.** Figure 25 [PITH_FULL_IMAGE:figures/full_fig_p019_25.png] view at source ↗

**Figure 26.** Figure 26: TokenWeave Decode Latency Gains on an 8×B200 DGX system for Llama-3.3-70B (context lengths 2K and 4K) and Qwen3-235B-A22B (context length 2K). 1K 2K 4K 8K 16K 32K 64K Sequence Length 0 500 1000 1500 2000 Latency (ms) 1.27x 1.30x 1.30x 1.38x 1.37x 1.35x 1.31x 1.05x 1.07x 1.05x 1.07x 1.04x 1.06x 1.06x 1.05x 1.07x 1.13x 1.22x 1.21x 1.20x 1.16x vLLM-Multimem vLLM-nocomm TokenWeave-fuseonly TokenWeave [PITH_F… view at source ↗

**Figure 27.** Figure 27: TokenWeave Prefill latency gains on an 8×B200 DGX system for Llama-3.3-70B (varying sequence lengths). TokenWeavefuseonly achieves 1.05×–1.07× speedup, while full TokenWeave achieves up to 1.22× over vLLM-Multimem. D ARTIFACT D.1 Abstract Distributed inference of LLMs can incur overheads of up to 20%, even when GPUs are connected via high-speed interconnects such as NVLink. Additionally, RMSNorm and resi… view at source ↗

read the original abstract

Distributed inference of large language models (LLMs) using tensor parallelism can introduce communication overheads of $20$% even over GPUs connected via NVLink, a high-speed GPU interconnect. Several techniques have been proposed to mitigate these overheads by decomposing computations into smaller tasks and overlapping communication with these subtasks. However, none of these techniques are turned on by default during tensor-parallel serving in systems like vLLM, SGLang and TensorRT-LLM. This is because the number of tokens processed per iteration is typically kept small to support low-latency serving, and decomposing such smaller workloads to enable communication overlap results in worse performance. Further, the communication itself uses many streaming multiprocessors (SMs) that would otherwise be available for computation, increasing overhead. We present TokenWeave, the first system to enable efficient compute-communication overlap for tensor-parallel model inference for token lengths as small as 1024. TokenWeave identifies RMSNorm, a previously overlooked operation, as crucial and optimizes it along with communication by implementing a novel fused AllReduce--RMSNorm kernel. Further, this kernel leverages the NVSHARP/Multimem feature available on modern GPUs (e.g., Hopper, Blackwell) to jointly perform communication and RMSNorm efficiently using only $2-8$ streaming multiprocessors (SMs) on an $8\times$H100 DGX system. Our evaluations demonstrate up to $\boldsymbol{1.28\times}$ speedup in latency (baseline$\div$ours) and up to $\boldsymbol{1.19\times}$ higher throughput (ours$\div$baseline) across multiple models and workloads. In several settings, TokenWeave delivers better performance than an equivalent model with all communication removed. The source code is available at https://github.com/microsoft/tokenweave.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TokenWeave fuses AllReduce with RMSNorm using few SMs on Hopper GPUs to enable overlap at small token counts where prior methods fail, but the gains hinge on unshown SM utilization details.

read the letter

The key takeaway here is that TokenWeave fuses AllReduce with RMSNorm to achieve compute-communication overlap in tensor-parallel inference even when processing as few as 1024 tokens. This is new because earlier techniques relied on decomposing workloads in ways that hurt performance at these small scales, and RMSNorm had not been used this way before. The paper does a good job of explaining the problem with current systems like vLLM and showing how their kernel leverages NVSHARP on Hopper GPUs to keep the fused operation lightweight. Reporting speedups across models and noting cases where it outperforms no-communication setups adds credibility. Releasing the code is helpful for the community. On the soft spots, the performance claims rest on the idea that allocating only 2-8 SMs to this kernel does not create bottlenecks elsewhere. The stress test note points out that without data on kernel occupancy or how the remaining SMs handle the matmuls, we cannot be sure the net gain holds up. The abstract lacks error bars and detailed workload descriptions, which makes it difficult to assess how robust the 1.28x and 1.19x numbers are. These results also depend on specific hardware capabilities, so generalization to other setups is unclear. This paper is aimed at practitioners and researchers focused on optimizing distributed inference for large models. Anyone dealing with latency-sensitive serving on NVLink-connected GPUs could find the technique and code valuable. It shows honest engagement with the literature on overlap methods and has a clear, reproducible contribution through the implementation. I would send it for peer review to get feedback on the evaluation methodology and hardware assumptions.

Referee Report

2 major / 2 minor

Summary. The paper introduces TokenWeave, a system for tensor-parallel distributed LLM inference that fuses AllReduce with RMSNorm into a single kernel. It leverages NVSHARP/Multimem hardware features on Hopper/Blackwell GPUs to perform this fusion using only 2-8 SMs, enabling effective compute-communication overlap even at small token lengths (down to 1024). Evaluations across multiple models report up to 1.28× latency speedup and 1.19× throughput improvement over baselines in vLLM, SGLang, and TensorRT-LLM, with some cases outperforming a no-communication model.

Significance. If the empirical results hold under scrutiny, this addresses a real deployment pain point: prior overlap techniques degrade performance at the small per-iteration token counts typical of low-latency serving. The work demonstrates practical use of modern GPU interconnect features (Multimem) for kernel fusion and could influence default configurations in production inference engines. The open-source release strengthens reproducibility.

major comments (2)

[Evaluation] Evaluation section: The central performance claims (1.28× latency, 1.19× throughput at 1024 tokens) are presented without error bars, exact workload parameters (batch sizes, sequence lengths per iteration, model configurations), or baseline hyperparameter settings. This makes it impossible to assess statistical significance or reproduce the results that underpin the claim of succeeding where prior methods fail.
[Section 3] Section 3 (fused kernel description): The key assumption that reserving only 2-8 SMs for the AllReduce-RMSNorm kernel leaves the remaining SMs sufficient for matmuls and other layers without occupancy or launch penalties is load-bearing for the small-token-length claims, yet no SM utilization traces, occupancy counters, or sensitivity sweeps over SM allocation are reported. Without these, it is unclear whether the reported gains could reverse under the exact conditions where prior overlap methods hurt performance.

minor comments (2)

The abstract states results 'across multiple models and workloads' but the evaluation section would benefit from an explicit table mapping each reported speedup to the precise model, tensor-parallel degree, and token count.
Notation for the fused kernel launch configuration (e.g., how the 2-8 SMs are selected and how the remaining SMs are partitioned) should be clarified with a small diagram or pseudocode for readers unfamiliar with Multimem.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments on our manuscript. We address each major comment point by point below and will revise the paper accordingly to improve reproducibility and strengthen the supporting evidence.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The central performance claims (1.28× latency, 1.19× throughput at 1024 tokens) are presented without error bars, exact workload parameters (batch sizes, sequence lengths per iteration, model configurations), or baseline hyperparameter settings. This makes it impossible to assess statistical significance or reproduce the results that underpin the claim of succeeding where prior methods fail.

Authors: We agree that the current evaluation section would benefit from greater detail to support reproducibility and allow readers to assess statistical significance. In the revised manuscript, we will add error bars from multiple independent runs, provide exact workload parameters including batch sizes and per-iteration sequence lengths, specify model configurations (e.g., layer counts and hidden dimensions), and document the precise hyperparameter settings used for the vLLM, SGLang, and TensorRT-LLM baselines. These additions will directly address the concerns and enable full reproduction of the reported speedups. revision: yes
Referee: [Section 3] Section 3 (fused kernel description): The key assumption that reserving only 2-8 SMs for the AllReduce-RMSNorm kernel leaves the remaining SMs sufficient for matmuls and other layers without occupancy or launch penalties is load-bearing for the small-token-length claims, yet no SM utilization traces, occupancy counters, or sensitivity sweeps over SM allocation are reported. Without these, it is unclear whether the reported gains could reverse under the exact conditions where prior overlap methods hurt performance.

Authors: The allocation of only 2-8 SMs to the fused kernel is intentional, as AllReduce and RMSNorm exhibit low arithmetic intensity at small token counts and can be efficiently executed without starving the compute-bound matmul layers. We recognize that empirical validation would strengthen this claim. In the revision, we will add a sensitivity analysis over SM allocations in Section 3 along with occupancy metrics obtained from profiling to demonstrate that the chosen allocation incurs no measurable launch or occupancy penalties under the evaluated conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems paper with measured results

full rationale

The paper is a systems/engineering contribution whose central claims rest on runtime measurements of a fused AllReduce-RMSNorm kernel on Hopper GPUs using NVSHARP/Multimem. No mathematical derivation chain, first-principles equations, fitted parameters presented as predictions, or load-bearing self-citations appear in the abstract or described content. Performance numbers (1.28× latency, 1.19× throughput) are reported from direct benchmarking rather than any reduction to inputs by construction. The work is therefore self-contained against external hardware benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence and correct functioning of NVSHARP/Multimem on current high-end GPUs and on the empirical observation that small SM allocation does not starve the remaining computation.

axioms (1)

domain assumption Modern GPUs (Hopper, Blackwell) expose NVSHARP/Multimem features that allow joint communication and RMSNorm with only 2-8 SMs.
Invoked when describing the fused kernel implementation and its resource usage.

pith-pipeline@v0.9.0 · 5868 in / 1298 out tokens · 39543 ms · 2026-05-22T14:18:55.280197+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TokenWeave ... fused AllReduce–RMSNorm kernel ... using only 2–8 streaming multiprocessors (SMs) on an 8×H100 DGX system ... wave-aware, two-way token split
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

up to 1.28× speedup ... for token lengths as small as 1024

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
cs.DC 2026-04 unverdicted novelty 7.0

Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs
cs.PL 2026-05 unverdicted novelty 6.0

DITRON introduces a hierarchical multi-level tiling compiler for distributed tensor programs that matches or exceeds expert CUDA libraries with 6-30% speedups and has been deployed to improve training MFU by over 10% ...
Rethinking Network Topologies for Cost-Effective Mixture-of-Experts LLM Serving
cs.NI 2026-04 unverdicted novelty 6.0

Switchless topologies such as 3D full-mesh are 20.6-56.2% more cost-effective than scale-up networks for MoE LLM serving, with current link bandwidths over-provisioned by up to 27%.
DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling
cs.DC 2026-05 unverdicted novelty 5.0

DynaFlow enables transparent intra-device parallelism in ML systems by separating model definition from execution scheduling, integrating into 6 frameworks with up to 1.29x throughput gains and minimal code changes.
Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs
cs.DC 2025-11 unverdicted novelty 4.0

Thermal imbalance in multi-GPU nodes creates hotter straggler GPUs that slow down cooler leader GPUs during overlapped computation and communication in LLM training.
Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities
cs.DC 2026-04 unverdicted novelty 3.0

A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 6 Pith papers · 4 internal anchors

[1]

S., Bui, T., Kim, S., Chang, W., and Goharian, N

Cohan, A., Dernoncourt, F., Kim, D. S., Bui, T., Kim, S., Chang, W., and Goharian, N. A discourse-aware attention model for abstractive summarization of long documents. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, Volume 2 (Short Papers), pp. 615–621, Ne...

work page 2018
[2]

doi: 10.18653/v1/N18-2097

Association for Computa- tional Linguistics. doi: 10.18653/v1/N18-2097. URL https://aclanthology.org/N18-2097. DeepSeek-AI. Profiling data in deepseek infra,

work page doi:10.18653/v1/n18-2097 2097
[3]

The Llama 3 Herd of Models

URL https://github.com/deepseek-ai/profile-data?tab= readme-ov-file#inference. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

ISBN 9781450392051

Association for Computing Machinery. ISBN 9781450392051. doi: 10.1145/3503222.3507778. URL https://doi.org/10.1145/3503222.3507778. Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna, E. B., Bressand, F., et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024a. Jiang, C., Tian, Y ...

work page doi:10.1145/3503222.3507778 2001
[5]

Scaling Laws for Neural Language Models

URL https://arxiv.org/abs/ 2001.08361. Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, pp. 611–626,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[6]

GPT-4 Technical Report

URL https://doi. org/10.48550/arXiv.2303.08774. Patel, P., Choukse, E., Zhang, C., Shah, A., Goiri,´I., Maleki, S., and Bianchini, R. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architec- ture (ISCA), pp. 118–132. IEEE,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774
[7]

TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference SGLang Team

URL https://github.com/pytorch/pytorch/blob/v2.6.0/torch/ csrc/distributed/c10d/CUDASymmetricMemoryOps.cu. TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference SGLang Team. Deploying DeepSeek with PD Disaggre- gation and Large-Scale Expert Parallelism on 96 H100 GPUs. https://lmsys.org/blog/2025-05-05-large-scale- ep/,

work page 2025
[8]

Shi, S., Pan, X., Chu, X., and Li, B

Dataset. Shi, S., Pan, X., Chu, X., and Li, B. Pipemoe: Accelerat- ing mixture-of-experts through adaptive pipelining. In IEEE INFOCOM 2023-IEEE Conference on Computer Communications, pp. 1–10. IEEE,

work page 2023
[9]

Optimization and tuning

vLLM Team. Optimization and tuning. https://docs.vllm.ai/ en/v0.8.5/performance/optimization.html, 2025a. vLLM Team. vllm v1, 2025b. URL https://blog.vllm.ai/ 2025/01/27/v1-alpha-release.html. Wang, S., Wei, J., Sabne, A., Davis, A., Ilbeyi, B., Hecht- man, B., Chen, D., Murthy, K. S., Maggioni, M., Zhang, Q., Kumar, S., Guo, T., Xu, Y ., and Zhou, Z. Ove...

work page 2025
[10]

ISBN 9781450399159

Association for Computing Machinery. ISBN 9781450399159. doi: 10.1145/3567955.3567959. URL https://doi.org/10.1145/3567955.3567959. Wang, Y ., He, H., Wright, L., Wehrstedt, L., Liu, T., and Liang, W. Distributed w/ torchtitan: Intro- ducing async tensor parallelism in pytorch,

work page doi:10.1145/3567955.3567959
[11]

Qwen2.5 Technical Report

URL https://dev- discuss.pytorch.org/t/pytorch-symmetricmemory- harnessing-nvlink-programmability-with-ease/2798/1. Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

ISBN 9798331314385

Curran Associates Inc. ISBN 9798331314385. Zheng, S., Bao, W., Hou, Q., Zheng, X., Fang, J., Huang, C., Li, T., Duanmu, H., Chen, R., Xu, R., Guo, Y ., Zheng, N., Jiang, Z., Di, X., Wang, D., Ye, J., Lin, H., Chang, L.-W., Lu, L., Liang, Y ., Zhai, J., and Liu, X. Triton- distributed: Programming overlapping kernels on dis- tributed ai systems with the tr...

work page arXiv
[13]

B EVALUATION We provide some additional evaluations of TokenWeave in this section

and the RMSNorm kernel from vLLM (vLLM Contributors, 2023). B EVALUATION We provide some additional evaluations of TokenWeave in this section. B.1 Throughput Gains Figure 20 presents TokenWeave’s throughput gains on 4×H100 GPUs for various end-to-end workload traces. Sim- ilar to the 8-GPU results discussed in the main paper, To- kenWeave consistently imp...

work page 2023
[14]

As 1K 2K 4K 8K 16K 32K 64K Sequence Length 0 1000 2000 3000Latency (ms) 1.05x 1.05x 1.06x 1.11x 1.15x 1.15x 1.12x Qwen3-235B-A22B (8x H100) vLLM-Multimem vLLM-nocomm T okenWeave Figure

work page 2000
[15]

TokenWeave throughput gains for end-to-end workload traces.Shown are throughput measurements across fixed (input, output)-length traces, as well as ShareGPT and arXiv traces, for two models on4×H100. 512 1K 2K 4K 8K 16K 32K 64K 0 2000 4000 6000 8000Time (ms) 1.04x 1.14x 1.11x 1.13x 1.16x 1.14x 1.12x 1.11x Llama-3.3-70B-Instruct (4x H100) 512 1K 2K 4K 8K 1...

work page 2000
[16]

TokenWeave latency gains.Shown are the execution times for prefill requests with varying sequence lengths for different models on4×H100. 1 2 4 8 16 32 64 0 250 500 750 1000 1250 1500Time (ms) 1.05x 1.21x 1.19x 1.23x 1.26x 1.29x 1.31x Llama-3.3-70B-Instruct (8x H100) 1 2 4 8 16 32 64 0 250 500 750 1000 1250 1500 1.06x 1.15x 1.13x 1.22x 1.23x 1.25x 1.26x Qw...

work page 2000
[17]

TokenWeave latency gains.Shown are the execution times for prefill requests with varying batch sizes and sequence length of 512 for different models on (a) 8×H100 and (b) 4×H100. In almost all cases, TokenWeave is close to or better than the theoretical vLLM-nocommbaseline with zero communication overhead, showing that TokenWeave not only recovers all com...

work page 2000
[18]

TokenWeave Fused AllReduce–RMSNorm kernel ablation.We compareTokenWeave-fuseonlyand full TokenWeave against thevLLM-Multimembaseline. Execution times are shown for prefill requests with varying sequence lengths for different models on 4×H100.TokenWeave-fuseonlyprovides gains due to the elimination of redundancy in RMSNorm computation and intermediate memo...

work page 2000
[19]

Results are shown for sequence lengths from 32 to 64K tokens (hidden size 8192, bf16)

Latency of the fused AllReduce–RMSNorm kernel versus SM count on an 8×B200 DGX system. Results are shown for sequence lengths from 32 to 64K tokens (hidden size 8192, bf16). Similar to the 8×H100 results, latency reductions diminish beyond roughly8–16SMs. C.2 Decode-Only Batches In the V1 architecture, vLLM usestorch.compile to re- duce Python execution o...

work page 2025
[20]

D ARTIFACT D.1 Abstract Distributed inference of LLMs can incur overheads of up to 20%, even when GPUs are connected via high-speed interconnects such as NVLink

TokenWeave Prefill latency gains on an8×B200DGX system for Llama-3.3-70B (varying sequence lengths).TokenWeave- fuseonlyachieves1.05×–1.07×speedup, while full TokenWeave achieves up to1.22×overvLLM-Multimem. D ARTIFACT D.1 Abstract Distributed inference of LLMs can incur overheads of up to 20%, even when GPUs are connected via high-speed interconnects suc...

work page doi:10.5281/zenodo.18844243

[1] [1]

S., Bui, T., Kim, S., Chang, W., and Goharian, N

Cohan, A., Dernoncourt, F., Kim, D. S., Bui, T., Kim, S., Chang, W., and Goharian, N. A discourse-aware attention model for abstractive summarization of long documents. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, Volume 2 (Short Papers), pp. 615–621, Ne...

work page 2018

[2] [2]

doi: 10.18653/v1/N18-2097

Association for Computa- tional Linguistics. doi: 10.18653/v1/N18-2097. URL https://aclanthology.org/N18-2097. DeepSeek-AI. Profiling data in deepseek infra,

work page doi:10.18653/v1/n18-2097 2097

[3] [3]

The Llama 3 Herd of Models

URL https://github.com/deepseek-ai/profile-data?tab= readme-ov-file#inference. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

ISBN 9781450392051

Association for Computing Machinery. ISBN 9781450392051. doi: 10.1145/3503222.3507778. URL https://doi.org/10.1145/3503222.3507778. Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna, E. B., Bressand, F., et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024a. Jiang, C., Tian, Y ...

work page doi:10.1145/3503222.3507778 2001

[5] [5]

Scaling Laws for Neural Language Models

URL https://arxiv.org/abs/ 2001.08361. Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, pp. 611–626,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[6] [6]

GPT-4 Technical Report

URL https://doi. org/10.48550/arXiv.2303.08774. Patel, P., Choukse, E., Zhang, C., Shah, A., Goiri,´I., Maleki, S., and Bianchini, R. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architec- ture (ISCA), pp. 118–132. IEEE,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774

[7] [7]

TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference SGLang Team

URL https://github.com/pytorch/pytorch/blob/v2.6.0/torch/ csrc/distributed/c10d/CUDASymmetricMemoryOps.cu. TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference SGLang Team. Deploying DeepSeek with PD Disaggre- gation and Large-Scale Expert Parallelism on 96 H100 GPUs. https://lmsys.org/blog/2025-05-05-large-scale- ep/,

work page 2025

[8] [8]

Shi, S., Pan, X., Chu, X., and Li, B

Dataset. Shi, S., Pan, X., Chu, X., and Li, B. Pipemoe: Accelerat- ing mixture-of-experts through adaptive pipelining. In IEEE INFOCOM 2023-IEEE Conference on Computer Communications, pp. 1–10. IEEE,

work page 2023

[9] [9]

Optimization and tuning

vLLM Team. Optimization and tuning. https://docs.vllm.ai/ en/v0.8.5/performance/optimization.html, 2025a. vLLM Team. vllm v1, 2025b. URL https://blog.vllm.ai/ 2025/01/27/v1-alpha-release.html. Wang, S., Wei, J., Sabne, A., Davis, A., Ilbeyi, B., Hecht- man, B., Chen, D., Murthy, K. S., Maggioni, M., Zhang, Q., Kumar, S., Guo, T., Xu, Y ., and Zhou, Z. Ove...

work page 2025

[10] [10]

ISBN 9781450399159

Association for Computing Machinery. ISBN 9781450399159. doi: 10.1145/3567955.3567959. URL https://doi.org/10.1145/3567955.3567959. Wang, Y ., He, H., Wright, L., Wehrstedt, L., Liu, T., and Liang, W. Distributed w/ torchtitan: Intro- ducing async tensor parallelism in pytorch,

work page doi:10.1145/3567955.3567959

[11] [11]

Qwen2.5 Technical Report

URL https://dev- discuss.pytorch.org/t/pytorch-symmetricmemory- harnessing-nvlink-programmability-with-ease/2798/1. Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

ISBN 9798331314385

Curran Associates Inc. ISBN 9798331314385. Zheng, S., Bao, W., Hou, Q., Zheng, X., Fang, J., Huang, C., Li, T., Duanmu, H., Chen, R., Xu, R., Guo, Y ., Zheng, N., Jiang, Z., Di, X., Wang, D., Ye, J., Lin, H., Chang, L.-W., Lu, L., Liang, Y ., Zhai, J., and Liu, X. Triton- distributed: Programming overlapping kernels on dis- tributed ai systems with the tr...

work page arXiv

[13] [13]

B EVALUATION We provide some additional evaluations of TokenWeave in this section

and the RMSNorm kernel from vLLM (vLLM Contributors, 2023). B EVALUATION We provide some additional evaluations of TokenWeave in this section. B.1 Throughput Gains Figure 20 presents TokenWeave’s throughput gains on 4×H100 GPUs for various end-to-end workload traces. Sim- ilar to the 8-GPU results discussed in the main paper, To- kenWeave consistently imp...

work page 2023

[14] [14]

As 1K 2K 4K 8K 16K 32K 64K Sequence Length 0 1000 2000 3000Latency (ms) 1.05x 1.05x 1.06x 1.11x 1.15x 1.15x 1.12x Qwen3-235B-A22B (8x H100) vLLM-Multimem vLLM-nocomm T okenWeave Figure

work page 2000

[15] [15]

TokenWeave throughput gains for end-to-end workload traces.Shown are throughput measurements across fixed (input, output)-length traces, as well as ShareGPT and arXiv traces, for two models on4×H100. 512 1K 2K 4K 8K 16K 32K 64K 0 2000 4000 6000 8000Time (ms) 1.04x 1.14x 1.11x 1.13x 1.16x 1.14x 1.12x 1.11x Llama-3.3-70B-Instruct (4x H100) 512 1K 2K 4K 8K 1...

work page 2000

[16] [16]

TokenWeave latency gains.Shown are the execution times for prefill requests with varying sequence lengths for different models on4×H100. 1 2 4 8 16 32 64 0 250 500 750 1000 1250 1500Time (ms) 1.05x 1.21x 1.19x 1.23x 1.26x 1.29x 1.31x Llama-3.3-70B-Instruct (8x H100) 1 2 4 8 16 32 64 0 250 500 750 1000 1250 1500 1.06x 1.15x 1.13x 1.22x 1.23x 1.25x 1.26x Qw...

work page 2000

[17] [17]

TokenWeave latency gains.Shown are the execution times for prefill requests with varying batch sizes and sequence length of 512 for different models on (a) 8×H100 and (b) 4×H100. In almost all cases, TokenWeave is close to or better than the theoretical vLLM-nocommbaseline with zero communication overhead, showing that TokenWeave not only recovers all com...

work page 2000

[18] [18]

TokenWeave Fused AllReduce–RMSNorm kernel ablation.We compareTokenWeave-fuseonlyand full TokenWeave against thevLLM-Multimembaseline. Execution times are shown for prefill requests with varying sequence lengths for different models on 4×H100.TokenWeave-fuseonlyprovides gains due to the elimination of redundancy in RMSNorm computation and intermediate memo...

work page 2000

[19] [19]

Results are shown for sequence lengths from 32 to 64K tokens (hidden size 8192, bf16)

Latency of the fused AllReduce–RMSNorm kernel versus SM count on an 8×B200 DGX system. Results are shown for sequence lengths from 32 to 64K tokens (hidden size 8192, bf16). Similar to the 8×H100 results, latency reductions diminish beyond roughly8–16SMs. C.2 Decode-Only Batches In the V1 architecture, vLLM usestorch.compile to re- duce Python execution o...

work page 2025

[20] [20]

D ARTIFACT D.1 Abstract Distributed inference of LLMs can incur overheads of up to 20%, even when GPUs are connected via high-speed interconnects such as NVLink

TokenWeave Prefill latency gains on an8×B200DGX system for Llama-3.3-70B (varying sequence lengths).TokenWeave- fuseonlyachieves1.05×–1.07×speedup, while full TokenWeave achieves up to1.22×overvLLM-Multimem. D ARTIFACT D.1 Abstract Distributed inference of LLMs can incur overheads of up to 20%, even when GPUs are connected via high-speed interconnects suc...

work page doi:10.5281/zenodo.18844243