pith. sign in

arxiv: 2505.11329 · v5 · submitted 2025-05-16 · 💻 cs.DC · cs.LG

TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference

Pith reviewed 2026-05-22 14:18 UTC · model grok-4.3

classification 💻 cs.DC cs.LG
keywords distributed LLM inferencetensor parallelismcompute-communication overlapAllReduceRMSNormGPU kernelsLLM serving
0
0 comments X

The pith

TokenWeave enables efficient compute-communication overlap for tensor-parallel LLM inference at token lengths as small as 1024.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how fusing AllReduce with RMSNorm into one kernel lets communication overlap with computation in tensor-parallel LLM serving even at low token counts. Existing systems turn overlap off for small workloads because splitting tasks usually adds more overhead than it saves and uses up too many GPU resources. The new kernel runs on modern GPUs with only a handful of streaming multiprocessors by using special hardware features for joint communication and normalization. If this holds, serving systems could deliver faster responses without forcing larger batches or removing parallelism.

Core claim

TokenWeave is the first system to enable efficient compute-communication overlap for tensor-parallel model inference for token lengths as small as 1024. It identifies RMSNorm as a key operation and optimizes it together with communication through a novel fused AllReduce-RMSNorm kernel. This kernel leverages the NVSHARP/Multimem feature on Hopper and Blackwell GPUs to perform both tasks jointly using only 2-8 streaming multiprocessors on an 8xH100 DGX system, delivering up to 1.28x latency speedup and up to 1.19x higher throughput across models and workloads.

What carries the argument

The fused AllReduce-RMSNorm kernel that uses NVSHARP/Multimem to jointly execute communication and RMSNorm while allocating only 2-8 streaming multiprocessors.

If this is right

  • Tensor-parallel serving can keep low token counts per iteration while cutting communication overheads that reach 20 percent on NVLink.
  • Latency drops and throughput rises across multiple models without changing the overall serving setup.
  • In some cases the system outperforms an equivalent model that has all communication removed entirely.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The fusion idea could be tested on other normalization or activation steps to reduce overhead in additional layers.
  • Systems running on older GPUs might need alternative low-SM methods to reach similar overlap for small token lengths.

Load-bearing premise

The performance gains require NVSHARP or Multimem hardware support on Hopper or Blackwell GPUs and assume that dedicating only a few SMs to the fused kernel leaves enough resources for the rest of the model.

What would settle it

Running the same workloads on GPUs without NVSHARP/Multimem support and checking whether the reported latency and throughput improvements over baseline still appear.

Figures

Figures reproduced from arXiv: 2505.11329 by Nipun Kwatra, Raja Gond, Ramachandran Ramjee.

Figure 1
Figure 1. Figure 1: AllReduce communication and RMSNorm overheads for three models versus sequence length on an 8×H100 DGX sys￾tem (median across 12 runs; GPU clocks are locked to the TDP frequency (Prescott, 2022)). Despite NVLink and NVSHARP, communication overheads range from 9–23%. RMSNorm over￾heads are also non-trivial, ranging from 4–9%. bers are reported with GPU clocks set to their TDP fre￾quency (Prescott, 2022) to … view at source ↗
Figure 2
Figure 2. Figure 2: Inference latency of Llama-3.3-70B on an 8×H100 DGX system for various sequence lengths. vLLM-Multimem corresponds to vLLM with an optimized AllReduce implementation using Mul￾timem (NVIDIA, 2023) and NVSHARP (NVIDIA, 2024) support. vLLM-nocomm is a counterfactual baseline corresponding to only the computation time without any communication. The dotted lines show performance normalized to the vLLM-Multimem… view at source ↗
Figure 3
Figure 3. Figure 3: Selective enabling of splitting/overlap in TokenWeave. At each iteration, the num tokens in the batch is checked against a token threshold. Full TokenWeave with splitting and overlap is enabled only for higher values of num tokens. For smaller num tokens, where splitting can result in higher overheads, we only enable the fused AllReduce–RMSNorm kernel. The method applies uniformly to prefill-only, mixed, a… view at source ↗
Figure 4
Figure 4. Figure 4: Splitting AllReduce (AR) into ReduceScatter (RS) and AllGather (AG) can result in non-trivial overheads. Shown are the absolute times and the relative performance (line plots) of these operations on an 8×H100 DGX system for varying sequence lengths. All runs are with a hidden size of 8192 using bf16. 0 20 Time (us) 0.97x 1.06x 1.01x 1.06x 1.10x 1.00x Sequence Length: 64 0 100 0.73x 1.04x 1.07x 1.06x 1.05x … view at source ↗
Figure 5
Figure 5. Figure 5: Multimem-based AllReduce implementations require very few SMs. Shown is the performance of the Multimem AllRe￾duce kernel with different numbers of SMs for varying sequence lengths (hidden size 8192, bf16). In most cases, 4–8 SMs are enough to achieve near-optimal performance. service-level objectives (SLOs) of interactive workloads. Moreover, distributed inference is more efficient in many cases, as it al… view at source ↗
Figure 6
Figure 6. Figure 6: Large collective operations are more efficient. Shown is the bandwidth for ReduceScatter (RS) on an 8×H100 DGX system for varying sequence lengths (hidden size 8192, bf16). Larger tensors result in much better bandwidth, demonstrating that splitting input into smaller parts results in overheads. seen in [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overview of TokenWeave: (a) Vanilla Tensor Parallelism: all compute and communication operations are performed sequen￾tially. (b) TokenWeave: the input batch is partitioned into two splits. AllReduce is fused with RMSNorm, and computation of one split overlaps with communication of the other split. Separate compute and communication streams weave to orchestrate the overlap [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 8
Figure 8. Figure 8: Benefits of smart-splitting: Execution of a kernel with ten CTAs on a hypothetical GPU with four SMs. Assuming each CTA exclusively occupies one SM, the kernel executes in multiple waves as SM resources become available. ter), NanoFlow overlaps GPU compute, HBM traffic, and NVLink transfers. The approach, however, relies on high batch sizes for breaking the input batch into sufficiently sized nano-batches,… view at source ↗
Figure 10
Figure 10. Figure 10: Our fused AllReduce–RMSNorm kernel performs opti￾mally with very few SMs. We show the latency of the kernel under varying numbers of SMs for different sequence lengths (hidden size 8192 on an 8×H100 DGX, bf16). Using 8 SMs is close to optimal in most cases. containing 132 CTAs (exactly one full wave) and the other split with 168 CTAs (one full wave and one partial wave). This method effectively maintains … view at source ↗
Figure 11
Figure 11. Figure 11: TokenWeave throughput gains for end-to-end workload traces. We show the measured throughput for ShareGPT, arXiv, as well as fixed (input, output)-length traces for different models on 8×H100 DGX. (arXiv) (ShareGPT) (512, 128) (1024, 128) 0 10000 20000 1.19x 1.20x 1.20x 1.21x CS: 1024 (8xH100) (arXiv) (ShareGPT) (512, 128) (1024, 128) 1.15x 1.17x 1.21x 1.24x CS: 2048 (8xH100) (arXiv) (ShareGPT) (512, 128) … view at source ↗
Figure 12
Figure 12. Figure 12: TokenWeave throughput gains for end-to-end traces under chunk size variation. We show the measured throughput for ShareGPT, arXiv, as well as fixed (input, output)-length traces for Llama-3.3-70B on 8×H100 DGX. Chunk size varied from 1024–8192. 5 EXPERIMENTAL EVALUATION We evaluate TokenWeave across a range of popular models and workload settings, and under multiple tensor parallelism (TP) configurations.… view at source ↗
Figure 13
Figure 13. Figure 13: TokenWeave latency gains. Execution times of prefill requests with varying sequence lengths for different models on 8×H100. TokenWeave is close to or better than the theoretical vLLM-nocomm baseline with zero communication overhead, showing that TokenWeave not only recovers all communication overhead, but provides additional gains due to RMSNorm optimization. 1K 2K 4K 8K 16K Seq Length 0 2000 4000 6000 80… view at source ↗
Figure 14
Figure 14. Figure 14: Single-layer latency for Llama-3.3-70B on an 8×H100 DGX system. Numbers at the top represent normalized perfor￾mance compared to vLLM-Multimem. While TileLink ends up with an overhead at small sequence lengths, TokenWeave consis￾tently provides high gains over the entire sequence length range. vert output tokens back to text performed on the CPU). Baselines: We first compare against the vLLM 0.8.5 imple￾m… view at source ↗
Figure 15
Figure 15. Figure 15: NanoFlow throughput for end-to-end workload traces under fixed (input, output)-length traces for Llama-3.3-70B on an 8×H100 DGX. nanoflow-full corresponds to full NanoFlow, while nanoflow-frameworkonly disables NanoFlow (both nanobatching and overlap) but uses their custom serving framework. sizes enabling lower TBT values but also resulting in lower throughput. In the second experiment, we vary the chunk… view at source ↗
Figure 16
Figure 16. Figure 16: We compare TokenWeave-fuseonly and full TokenWeave against the vLLM-Multimem baseline. Execution times are shown for prefill requests with varying sequence lengths for different models on 8×H100. TokenWeave-fuseonly provides gains due to the elimination of redundancy in RMSNorm computation and intermediate memory accesses, while TokenWeave provides additional gains from the compute-communication overlap. … view at source ↗
Figure 17
Figure 17. Figure 17: Ablation results. TokenWeave-fuseonly uses only the fused kernel. TokenWeave-equalsplit enables token splitting and overlap but does not apply smart-splitting. Experiments are conducted on an 8×H100 DGX system. We evaluate the adapted NanoFlow implementation under two conditions: with NanoFlow enabled and disabled within its own serving stack [PITH_FULL_IMAGE:figures/full_fig_p012_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: presents the implementation of our fused AllReduce–RMSNorm CUDA kernel. Each CTA processes a subset of tokens, performs inter-GPU reduction using the multimem ld reduce add primitive, computes the local variance, and applies normalization before storing results via multimem st. By eliminating intermediate memory accesses and offloading reduction to NVSwitch, this fused implementation achieves lower HBM co… view at source ↗
Figure 19
Figure 19. Figure 19: Shown are the execution times for prefill requests with varying sequence lengths for Qwen3-235B-A22B on an 8×H100 DGX system. shown, the results stay similar to the case of fixed batch size and varying sequence lengths ( [PITH_FULL_IMAGE:figures/full_fig_p016_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: TokenWeave throughput gains for end-to-end workload traces. Shown are throughput measurements across fixed (input, output)-length traces, as well as ShareGPT and arXiv traces, for two models on 4×H100. 512 1K 2K 4K 8K 16K 32K 64K 0 2000 4000 6000 8000 Time (ms) 1.04x 1.14x 1.11x 1.13x 1.16x 1.14x 1.12x 1.11x Llama-3.3-70B-Instruct (4x H100) 512 1K 2K 4K 8K 16K 32K 64K 0 2000 4000 6000 1.04x 1.14x 1.12x 1.… view at source ↗
Figure 21
Figure 21. Figure 21: TokenWeave latency gains. Shown are the execution times for prefill requests with varying sequence lengths for different models on 4×H100. 1 2 4 8 16 32 64 0 250 500 750 1000 1250 1500 Time (ms) 1.05x 1.21x 1.19x 1.23x 1.26x 1.29x 1.31x Llama-3.3-70B-Instruct (8x H100) 1 2 4 8 16 32 64 0 250 500 750 1000 1250 1500 1.06x 1.15x 1.13x 1.22x 1.23x 1.25x 1.26x Qwen2.5-72B-Instruct (8x H100) 1 2 4 8 16 32 64 0 … view at source ↗
Figure 22
Figure 22. Figure 22: TokenWeave latency gains. Shown are the execution times for prefill requests with varying batch sizes and sequence length of 512 for different models on (a) 8×H100 and (b) 4×H100. In almost all cases, TokenWeave is close to or better than the theoretical vLLM-nocomm baseline with zero communication overhead, showing that TokenWeave not only recovers all communication overhead but also provides additional … view at source ↗
Figure 23
Figure 23. Figure 23: TokenWeave Fused AllReduce–RMSNorm kernel ablation. We compare TokenWeave-fuseonly and full TokenWeave against the vLLM-Multimem baseline. Execution times are shown for prefill requests with varying sequence lengths for different models on 4×H100. TokenWeave-fuseonly provides gains due to the elimination of redundancy in RMSNorm computation and intermediate memory accesses, while TokenWeave provides addit… view at source ↗
Figure 24
Figure 24. Figure 24: TokenWeave Fused AllReduce–RMSNorm kernel ablation. Shown are the execution times of prefill requests with varying batch sizes, with the sequence length fixed at 512, for different models on (a) 8×H100 and (b) 4×H100. TokenWeave-fuseonly provides gains due to the elimination of redundancy in RMSNorm computation and intermediate memory accesses, while TokenWeave provides additional gains through compute-co… view at source ↗
Figure 25
Figure 25. Figure 25 [PITH_FULL_IMAGE:figures/full_fig_p019_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: TokenWeave Decode Latency Gains on an 8×B200 DGX system for Llama-3.3-70B (context lengths 2K and 4K) and Qwen3-235B-A22B (context length 2K). 1K 2K 4K 8K 16K 32K 64K Sequence Length 0 500 1000 1500 2000 Latency (ms) 1.27x 1.30x 1.30x 1.38x 1.37x 1.35x 1.31x 1.05x 1.07x 1.05x 1.07x 1.04x 1.06x 1.06x 1.05x 1.07x 1.13x 1.22x 1.21x 1.20x 1.16x vLLM-Multimem vLLM-nocomm TokenWeave-fuseonly TokenWeave [PITH_F… view at source ↗
Figure 27
Figure 27. Figure 27: TokenWeave Prefill latency gains on an 8×B200 DGX system for Llama-3.3-70B (varying sequence lengths). TokenWeave￾fuseonly achieves 1.05×–1.07× speedup, while full TokenWeave achieves up to 1.22× over vLLM-Multimem. D ARTIFACT D.1 Abstract Distributed inference of LLMs can incur overheads of up to 20%, even when GPUs are connected via high-speed interconnects such as NVLink. Additionally, RMSNorm and resi… view at source ↗
read the original abstract

Distributed inference of large language models (LLMs) using tensor parallelism can introduce communication overheads of $20$% even over GPUs connected via NVLink, a high-speed GPU interconnect. Several techniques have been proposed to mitigate these overheads by decomposing computations into smaller tasks and overlapping communication with these subtasks. However, none of these techniques are turned on by default during tensor-parallel serving in systems like vLLM, SGLang and TensorRT-LLM. This is because the number of tokens processed per iteration is typically kept small to support low-latency serving, and decomposing such smaller workloads to enable communication overlap results in worse performance. Further, the communication itself uses many streaming multiprocessors (SMs) that would otherwise be available for computation, increasing overhead. We present TokenWeave, the first system to enable efficient compute-communication overlap for tensor-parallel model inference for token lengths as small as 1024. TokenWeave identifies RMSNorm, a previously overlooked operation, as crucial and optimizes it along with communication by implementing a novel fused AllReduce--RMSNorm kernel. Further, this kernel leverages the NVSHARP/Multimem feature available on modern GPUs (e.g., Hopper, Blackwell) to jointly perform communication and RMSNorm efficiently using only $2-8$ streaming multiprocessors (SMs) on an $8\times$H100 DGX system. Our evaluations demonstrate up to $\boldsymbol{1.28\times}$ speedup in latency (baseline$\div$ours) and up to $\boldsymbol{1.19\times}$ higher throughput (ours$\div$baseline) across multiple models and workloads. In several settings, TokenWeave delivers better performance than an equivalent model with all communication removed. The source code is available at https://github.com/microsoft/tokenweave.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TokenWeave, a system for tensor-parallel distributed LLM inference that fuses AllReduce with RMSNorm into a single kernel. It leverages NVSHARP/Multimem hardware features on Hopper/Blackwell GPUs to perform this fusion using only 2-8 SMs, enabling effective compute-communication overlap even at small token lengths (down to 1024). Evaluations across multiple models report up to 1.28× latency speedup and 1.19× throughput improvement over baselines in vLLM, SGLang, and TensorRT-LLM, with some cases outperforming a no-communication model.

Significance. If the empirical results hold under scrutiny, this addresses a real deployment pain point: prior overlap techniques degrade performance at the small per-iteration token counts typical of low-latency serving. The work demonstrates practical use of modern GPU interconnect features (Multimem) for kernel fusion and could influence default configurations in production inference engines. The open-source release strengthens reproducibility.

major comments (2)
  1. [Evaluation] Evaluation section: The central performance claims (1.28× latency, 1.19× throughput at 1024 tokens) are presented without error bars, exact workload parameters (batch sizes, sequence lengths per iteration, model configurations), or baseline hyperparameter settings. This makes it impossible to assess statistical significance or reproduce the results that underpin the claim of succeeding where prior methods fail.
  2. [Section 3] Section 3 (fused kernel description): The key assumption that reserving only 2-8 SMs for the AllReduce-RMSNorm kernel leaves the remaining SMs sufficient for matmuls and other layers without occupancy or launch penalties is load-bearing for the small-token-length claims, yet no SM utilization traces, occupancy counters, or sensitivity sweeps over SM allocation are reported. Without these, it is unclear whether the reported gains could reverse under the exact conditions where prior overlap methods hurt performance.
minor comments (2)
  1. The abstract states results 'across multiple models and workloads' but the evaluation section would benefit from an explicit table mapping each reported speedup to the precise model, tensor-parallel degree, and token count.
  2. Notation for the fused kernel launch configuration (e.g., how the 2-8 SMs are selected and how the remaining SMs are partitioned) should be clarified with a small diagram or pseudocode for readers unfamiliar with Multimem.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments on our manuscript. We address each major comment point by point below and will revise the paper accordingly to improve reproducibility and strengthen the supporting evidence.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The central performance claims (1.28× latency, 1.19× throughput at 1024 tokens) are presented without error bars, exact workload parameters (batch sizes, sequence lengths per iteration, model configurations), or baseline hyperparameter settings. This makes it impossible to assess statistical significance or reproduce the results that underpin the claim of succeeding where prior methods fail.

    Authors: We agree that the current evaluation section would benefit from greater detail to support reproducibility and allow readers to assess statistical significance. In the revised manuscript, we will add error bars from multiple independent runs, provide exact workload parameters including batch sizes and per-iteration sequence lengths, specify model configurations (e.g., layer counts and hidden dimensions), and document the precise hyperparameter settings used for the vLLM, SGLang, and TensorRT-LLM baselines. These additions will directly address the concerns and enable full reproduction of the reported speedups. revision: yes

  2. Referee: [Section 3] Section 3 (fused kernel description): The key assumption that reserving only 2-8 SMs for the AllReduce-RMSNorm kernel leaves the remaining SMs sufficient for matmuls and other layers without occupancy or launch penalties is load-bearing for the small-token-length claims, yet no SM utilization traces, occupancy counters, or sensitivity sweeps over SM allocation are reported. Without these, it is unclear whether the reported gains could reverse under the exact conditions where prior overlap methods hurt performance.

    Authors: The allocation of only 2-8 SMs to the fused kernel is intentional, as AllReduce and RMSNorm exhibit low arithmetic intensity at small token counts and can be efficiently executed without starving the compute-bound matmul layers. We recognize that empirical validation would strengthen this claim. In the revision, we will add a sensitivity analysis over SM allocations in Section 3 along with occupancy metrics obtained from profiling to demonstrate that the chosen allocation incurs no measurable launch or occupancy penalties under the evaluated conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems paper with measured results

full rationale

The paper is a systems/engineering contribution whose central claims rest on runtime measurements of a fused AllReduce-RMSNorm kernel on Hopper GPUs using NVSHARP/Multimem. No mathematical derivation chain, first-principles equations, fitted parameters presented as predictions, or load-bearing self-citations appear in the abstract or described content. Performance numbers (1.28× latency, 1.19× throughput) are reported from direct benchmarking rather than any reduction to inputs by construction. The work is therefore self-contained against external hardware benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence and correct functioning of NVSHARP/Multimem on current high-end GPUs and on the empirical observation that small SM allocation does not starve the remaining computation.

axioms (1)
  • domain assumption Modern GPUs (Hopper, Blackwell) expose NVSHARP/Multimem features that allow joint communication and RMSNorm with only 2-8 SMs.
    Invoked when describing the fused kernel implementation and its resource usage.

pith-pipeline@v0.9.0 · 5868 in / 1298 out tokens · 39543 ms · 2026-05-22T14:18:55.280197+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods

    cs.DC 2026-04 unverdicted novelty 7.0

    Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.

  2. DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

    cs.PL 2026-05 unverdicted novelty 6.0

    DITRON introduces a hierarchical multi-level tiling compiler for distributed tensor programs that matches or exceeds expert CUDA libraries with 6-30% speedups and has been deployed to improve training MFU by over 10% ...

  3. Rethinking Network Topologies for Cost-Effective Mixture-of-Experts LLM Serving

    cs.NI 2026-04 unverdicted novelty 6.0

    Switchless topologies such as 3D full-mesh are 20.6-56.2% more cost-effective than scale-up networks for MoE LLM serving, with current link bandwidths over-provisioned by up to 27%.

  4. DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling

    cs.DC 2026-05 unverdicted novelty 5.0

    DynaFlow enables transparent intra-device parallelism in ML systems by separating model definition from execution scheduling, integrating into 6 frameworks with up to 1.29x throughput gains and minimal code changes.

  5. Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs

    cs.DC 2025-11 unverdicted novelty 4.0

    Thermal imbalance in multi-GPU nodes creates hotter straggler GPUs that slow down cooler leader GPUs during overlapped computation and communication in LLM training.

  6. Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities

    cs.DC 2026-04 unverdicted novelty 3.0

    A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 6 Pith papers · 4 internal anchors

  1. [1]

    S., Bui, T., Kim, S., Chang, W., and Goharian, N

    Cohan, A., Dernoncourt, F., Kim, D. S., Bui, T., Kim, S., Chang, W., and Goharian, N. A discourse-aware attention model for abstractive summarization of long documents. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, Volume 2 (Short Papers), pp. 615–621, Ne...

  2. [2]

    doi: 10.18653/v1/N18-2097

    Association for Computa- tional Linguistics. doi: 10.18653/v1/N18-2097. URL https://aclanthology.org/N18-2097. DeepSeek-AI. Profiling data in deepseek infra,

  3. [3]

    The Llama 3 Herd of Models

    URL https://github.com/deepseek-ai/profile-data?tab= readme-ov-file#inference. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  4. [4]

    ISBN 9781450392051

    Association for Computing Machinery. ISBN 9781450392051. doi: 10.1145/3503222.3507778. URL https://doi.org/10.1145/3503222.3507778. Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna, E. B., Bressand, F., et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024a. Jiang, C., Tian, Y ...

  5. [5]

    Scaling Laws for Neural Language Models

    URL https://arxiv.org/abs/ 2001.08361. Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, pp. 611–626,

  6. [6]

    GPT-4 Technical Report

    URL https://doi. org/10.48550/arXiv.2303.08774. Patel, P., Choukse, E., Zhang, C., Shah, A., Goiri,´I., Maleki, S., and Bianchini, R. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architec- ture (ISCA), pp. 118–132. IEEE,

  7. [7]

    TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference SGLang Team

    URL https://github.com/pytorch/pytorch/blob/v2.6.0/torch/ csrc/distributed/c10d/CUDASymmetricMemoryOps.cu. TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference SGLang Team. Deploying DeepSeek with PD Disaggre- gation and Large-Scale Expert Parallelism on 96 H100 GPUs. https://lmsys.org/blog/2025-05-05-large-scale- ep/,

  8. [8]

    Shi, S., Pan, X., Chu, X., and Li, B

    Dataset. Shi, S., Pan, X., Chu, X., and Li, B. Pipemoe: Accelerat- ing mixture-of-experts through adaptive pipelining. In IEEE INFOCOM 2023-IEEE Conference on Computer Communications, pp. 1–10. IEEE,

  9. [9]

    Optimization and tuning

    vLLM Team. Optimization and tuning. https://docs.vllm.ai/ en/v0.8.5/performance/optimization.html, 2025a. vLLM Team. vllm v1, 2025b. URL https://blog.vllm.ai/ 2025/01/27/v1-alpha-release.html. Wang, S., Wei, J., Sabne, A., Davis, A., Ilbeyi, B., Hecht- man, B., Chen, D., Murthy, K. S., Maggioni, M., Zhang, Q., Kumar, S., Guo, T., Xu, Y ., and Zhou, Z. Ove...

  10. [10]

    ISBN 9781450399159

    Association for Computing Machinery. ISBN 9781450399159. doi: 10.1145/3567955.3567959. URL https://doi.org/10.1145/3567955.3567959. Wang, Y ., He, H., Wright, L., Wehrstedt, L., Liu, T., and Liang, W. Distributed w/ torchtitan: Intro- ducing async tensor parallelism in pytorch,

  11. [11]

    Qwen2.5 Technical Report

    URL https://dev- discuss.pytorch.org/t/pytorch-symmetricmemory- harnessing-nvlink-programmability-with-ease/2798/1. Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

  12. [12]

    ISBN 9798331314385

    Curran Associates Inc. ISBN 9798331314385. Zheng, S., Bao, W., Hou, Q., Zheng, X., Fang, J., Huang, C., Li, T., Duanmu, H., Chen, R., Xu, R., Guo, Y ., Zheng, N., Jiang, Z., Di, X., Wang, D., Ye, J., Lin, H., Chang, L.-W., Lu, L., Liang, Y ., Zhai, J., and Liu, X. Triton- distributed: Programming overlapping kernels on dis- tributed ai systems with the tr...

  13. [13]

    B EVALUATION We provide some additional evaluations of TokenWeave in this section

    and the RMSNorm kernel from vLLM (vLLM Contributors, 2023). B EVALUATION We provide some additional evaluations of TokenWeave in this section. B.1 Throughput Gains Figure 20 presents TokenWeave’s throughput gains on 4×H100 GPUs for various end-to-end workload traces. Sim- ilar to the 8-GPU results discussed in the main paper, To- kenWeave consistently imp...

  14. [14]

    As 1K 2K 4K 8K 16K 32K 64K Sequence Length 0 1000 2000 3000Latency (ms) 1.05x 1.05x 1.06x 1.11x 1.15x 1.15x 1.12x Qwen3-235B-A22B (8x H100) vLLM-Multimem vLLM-nocomm T okenWeave Figure

  15. [15]

    TokenWeave throughput gains for end-to-end workload traces.Shown are throughput measurements across fixed (input, output)-length traces, as well as ShareGPT and arXiv traces, for two models on4×H100. 512 1K 2K 4K 8K 16K 32K 64K 0 2000 4000 6000 8000Time (ms) 1.04x 1.14x 1.11x 1.13x 1.16x 1.14x 1.12x 1.11x Llama-3.3-70B-Instruct (4x H100) 512 1K 2K 4K 8K 1...

  16. [16]

    TokenWeave latency gains.Shown are the execution times for prefill requests with varying sequence lengths for different models on4×H100. 1 2 4 8 16 32 64 0 250 500 750 1000 1250 1500Time (ms) 1.05x 1.21x 1.19x 1.23x 1.26x 1.29x 1.31x Llama-3.3-70B-Instruct (8x H100) 1 2 4 8 16 32 64 0 250 500 750 1000 1250 1500 1.06x 1.15x 1.13x 1.22x 1.23x 1.25x 1.26x Qw...

  17. [17]

    TokenWeave latency gains.Shown are the execution times for prefill requests with varying batch sizes and sequence length of 512 for different models on (a) 8×H100 and (b) 4×H100. In almost all cases, TokenWeave is close to or better than the theoretical vLLM-nocommbaseline with zero communication overhead, showing that TokenWeave not only recovers all com...

  18. [18]

    TokenWeave Fused AllReduce–RMSNorm kernel ablation.We compareTokenWeave-fuseonlyand full TokenWeave against thevLLM-Multimembaseline. Execution times are shown for prefill requests with varying sequence lengths for different models on 4×H100.TokenWeave-fuseonlyprovides gains due to the elimination of redundancy in RMSNorm computation and intermediate memo...

  19. [19]

    Results are shown for sequence lengths from 32 to 64K tokens (hidden size 8192, bf16)

    Latency of the fused AllReduce–RMSNorm kernel versus SM count on an 8×B200 DGX system. Results are shown for sequence lengths from 32 to 64K tokens (hidden size 8192, bf16). Similar to the 8×H100 results, latency reductions diminish beyond roughly8–16SMs. C.2 Decode-Only Batches In the V1 architecture, vLLM usestorch.compile to re- duce Python execution o...

  20. [20]

    D ARTIFACT D.1 Abstract Distributed inference of LLMs can incur overheads of up to 20%, even when GPUs are connected via high-speed interconnects such as NVLink

    TokenWeave Prefill latency gains on an8×B200DGX system for Llama-3.3-70B (varying sequence lengths).TokenWeave- fuseonlyachieves1.05×–1.07×speedup, while full TokenWeave achieves up to1.22×overvLLM-Multimem. D ARTIFACT D.1 Abstract Distributed inference of LLMs can incur overheads of up to 20%, even when GPUs are connected via high-speed interconnects suc...