arxiv: 2604.17172 · v2 · submitted 2026-04-19 · 💻 cs.DC · cs.AI

Recognition: unknown

UCCL-Zip: Lossless Compression Supercharged GPU Communication

Shuang Ma , Chon Lam Lao , Zhiying Xu , Zhuang Wang , Ziming Mao , Delong Meng , Jia Zhen , Jun Wu

show 3 more authors

Ion Stoica Yida Wang Yang Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:31 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords lossless compressionGPU communicationpoint-to-pointNCCLcollective communicationLLM trainingreinforcement learningdistributed inference

0 comments

The pith

Lossless compression fuses into GPU communication kernels to cut synchronization time without errors or API changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that lossless compression can be embedded directly inside existing GPU point-to-point and collective communication paths. It does so by redesigning the data movement pipelines and kernel execution models so that compression and transmission overlap on large blocks while preserving exact numerical values. A sympathetic reader would care because GPU communication now dominates runtime in large-model training and serving, and earlier compression methods either risked accuracy loss or forced changes to application code. If the approach holds, distributed workloads can send less data per transfer and finish faster while remaining fully compatible with current libraries and programs.

Core claim

UCCL-Zip integrates lossless compression directly into GPU communication primitives. For point-to-point communication it uses a split-send pipeline that exposes transmissible data early and overlaps compression with communication while operating on large data blocks. For collective communication it fuses compression into NCCL's persistent kernel model, eliminating redundant memory traffic and kernel launches. The design supports both patterns without modifying user-facing APIs and without compromising numerical correctness.

What carries the argument

The split-send pipeline for P2P and the fused compression step inside NCCL persistent kernels, which together allow compression to run concurrently with data transfer and remove extra memory operations.

If this is right

RL weight synchronization accelerates by up to 47.5 percent.
vLLM end-to-end inference latency drops by up to 10 percent.
No changes are required to application source code or APIs.
Both point-to-point and collective patterns remain fully supported.
Numerical results stay identical to the uncompressed case.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion pattern could be applied to additional collective operations such as all-reduce variants if the kernel integration cost remains low.
Shorter synchronization times might enable more frequent model updates in distributed reinforcement learning without raising total wall-clock time.
Lower communication duration could reduce the fraction of time GPUs spend idle, improving overall cluster throughput in multi-tenant environments.
If the overhead stays favorable on newer interconnects, communication libraries might adopt fused lossless compression as a default option.

Load-bearing premise

The time cost of performing lossless compression and decompression on the GPU stays small enough relative to the bandwidth saved that overall communication finishes faster across real workloads.

What would settle it

A direct timing measurement on an LLM workload showing that the added compression plus decompression latency exceeds the reduction in network transfer time for the observed data volumes and link speeds.

Figures

Figures reproduced from arXiv: 2604.17172 by Chon Lam Lao, Delong Meng, Ion Stoica, Jia Zhen, Jun Wu, Shuang Ma, Yang Zhou, Yida Wang, Zhiying Xu, Zhuang Wang, Ziming Mao.

**Figure 1.** Figure 1: Overview of UCCL-Zip. (a) Naive design suffers from lack of overlap and additional kernel overhead. (b) Uzip-P2P enables early transmission and overlaps compression with communication; detailed steps are shown in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview and time breakdown of a typical GPU-based floating-point ANS compression pipeline (S1–S3 denote Steps 1–3). field and the remaining bits (sign and fraction), and compress only the exponent field. This design improves compression efficiency compared to directly compressing raw floating-point representations [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Comparison of pipelining designs for overlapping compression and communication. (b)(c) Chunk-based pipelining assumes latency scales with size but is ineffective for GPU compression. (d) Split-send exposes transmissible data early and overlaps communication with remaining compression. early and overlapping communication with compute-heavy compression stages, split-send reduces communication stall time a… view at source ↗

**Figure 5.** Figure 5: Localized frequency tables eliminate global coordination and enable a fully fused compression pipeline. (a) A global frequency table requires cross-CTA synchronization, preventing kernel fusion and introducing additional memory passes. (b) With localized tables, each CTA independently samples and constructs its own table, enabling a fused pipeline within a single kernel without synchronization. (c) This d… view at source ↗

**Figure 6.** Figure 6: illustrates the workflow of NCCL all_reduce with compression. In the original NCCL pipeline, each GPU performs a sequence of receive–reduce–send steps for every data slice. Our design extends this pipeline by inserting compression and decompression stages while preserving the streaming execution model. Specifically, data is compressed at the sender, transmitted in compressed form, and decompressed at the… view at source ↗

**Figure 7.** Figure 7: Throughput comparison for bfloat16 peer-to-peer communication across tensor sizes. 128KB 1MB 8MB 128MB 1GB Tensor Size 0 2 4 6 Throughput (GB/s) Uzip-NCCL NCCL (a) all_to_all throughput. 2MB 8MB 32MB 128MB 512MB 2GB Tensor Size 0 1 2 3 4 Throughput (GB/s) Uzip-NCCL NCCL (b) Ring all_reduce throughput [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Throughput of Uzip-NCCL collective communication primitives across varying sizes. approaching the theoretical upper bound of 73.8 GB/s derived from Amdahl’s Law under a 64% compression ratio. For smaller tensors, the benefits are more modest (e.g., 8% at 16 MB and 24% at 32 MB), as compression overhead partially offsets bandwidth savings. We generate bf16 tensors with values uniformly distributed in [−1,… view at source ↗

**Figure 9.** Figure 9: Throughput of two-shot all_reduce implemented with asynchronous isend/irecv on two p5en.48xlarge instances (16 GPUs total) with NVLink disabled. of two phases: reduce-scatter and all-gather. The all-gather phase involves only data movement without computation, and thus exhibits similar compression behavior across different implementations (e.g., ring-based and two-shot). In contrast, the reduce-scatter int… view at source ↗

**Figure 10.** Figure 10: Application-level evaluation of Uzip-P2P on bf16 weight tensors during RL training. (a) GLM4-9B and (b) Qwen3.5-35B-A3B. 7680 10240 15360 25600 51200 76800 102400 Input Token 0 1000 2000 Latency (ms) NCCL Uzip-NCCL [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: KV cache transfer latency under prefill–decode disaggregation (P1D3) in vLLM for the Qwen-7B-Chat model. two representative models: the dense GLM4-9B (9B parameters) and the mixture-of-experts (MoE) model Qwen3.5- 35B-A3B (35B parameters). The training pipeline runs on 8 GPUs, where 4 GPUs perform policy optimization and the remaining 4 GPUs generate rollouts [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 12.** Figure 12: Communication throughput for different versions of the gate_up_proj weight tensor (214 MB) during GLM4-9B RL training. 256KB 1MB 8MB 128MB 1GB Tensor Size 0 25 50 75 Throughput (GB/s) bfloat16 float16 float32 float8_e4m3fn float8_e5m2 UCCL-P2P (a) Uzip-P2P throughput across floating-point types. 16MB 64MB 256MB 1GB Tensor Size 0.00 0.25 0.50 0.75 1.00 Compression Ratio bfloat16 float16 float32 float8_e4m3… view at source ↗

**Figure 13.** Figure 13: Performance of Uzip-P2P across different floating-point data types. 5.3.2 KV Cache Transfer in Prefill–Decode Disaggregation. We integrate Uzip-NCCL with the Prefill–Decode disaggregation inference pipeline of vLLM [28] to evaluate its performance in realistic distributed LLM serving workloads without application changes. Experiments follow the default Prefill–Decode disaggregation configuration (P1D3) i… view at source ↗

**Figure 17.** Figure 17: Uzip-P2P throughput on AMD MI355X over RoCEv2. 4MB 16MB 64MB 256MB 512MB 1GB Tensor Size 0 100 200 300 Throughput (GB/s) Uzip-NCCL NCCL 4 SMs 16 SMs 64 SMs [PITH_FULL_IMAGE:figures/full_fig_p011_17.png] view at source ↗

read the original abstract

The rapid growth of large language models (LLMs) has made GPU communication a critical bottleneck. While prior work reduces communication volume via quantization or lossy compression, these approaches introduce numerical errors that can degrade convergence, accuracy, and stability. We present UCCL-Zip, a unified design that integrates lossless compression directly into GPU communication primitives. UCCL-Zip supports both point-to-point (P2P) and collective communication without modifying user-facing APIs or compromising numerical correctness. For P2P communication, Uzip-P2P employs a split-send pipeline that exposes transmissible data early and overlaps compression with communication, while preserving high GPU efficiency by operating on large data blocks. For collective communication, Uzip-NCCL integrates compression into NCCL's persistent kernel model via fused execution, eliminating redundant memory traffic and kernel launches. In real workloads, UCCL-Zip accelerates RL weight synchronization by up to 47.5% and reduces vLLM end-to-end inference latency by up to 10%, all without application changes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UCCL-Zip adds lossless compression to P2P and NCCL paths via split-send and kernel fusion, delivering reported speedups without API changes, but the fusion overheads remain unproven from the abstract alone.

read the letter

UCCL-Zip integrates lossless compression directly into GPU communication for both point-to-point and collective operations. The split-send pipeline for P2P overlaps compression with transmission on large blocks, and the fused execution inside NCCL persistent kernels aims to cut extra launches and memory traffic. These changes keep the user APIs untouched and preserve exact numerical results, which sets it apart from the lossy quantization work it cites. The reported 47.5% faster RL weight sync and 10% lower vLLM inference latency are the concrete outcomes if they hold up in practice. That is the useful engineering piece here. The paper does a solid job framing the problem around communication bottlenecks in LLM training and inference, then showing a drop-in approach that avoids accuracy risks. The design choices for large-block operation and persistent-kernel fusion are specific and targeted at real NCCL usage. The main soft spot is the missing evidence on whether the added compression compute and buffers actually stay hidden. Persistent kernels are already tuned for minimal overhead on direct copies, so any intra-kernel work risks new synchronization or temporary storage costs that could offset bandwidth savings, especially on smaller tensors or mixed workloads. The abstract gives no microbenchmark isolation of compression latency versus traffic reduction, no error bars, and no breakdown by tensor size or communication dominance. Without those, the speedups could be tied to particular setups where communication is overwhelmingly the limiter. This work is for distributed systems people who tune large-scale training or serving stacks and hit NCCL walls but cannot rewrite their models. A reader gets practical implementation ideas and a clear contrast to lossy alternatives. It deserves a serious referee because the contribution is a targeted systems change with measurable claims, even if the evaluation needs tighter scrutiny for reproducibility and generality. Send it to peer review.

Referee Report

2 major / 1 minor

Summary. The manuscript presents UCCL-Zip, a system that integrates lossless compression directly into GPU communication primitives for both point-to-point (P2P) and collective operations. For P2P, it uses a split-send pipeline to overlap compression with communication; for collectives, it fuses compression into NCCL persistent kernels to eliminate redundant traffic and launches. The work claims up to 47.5% speedup in RL weight synchronization and up to 10% reduction in vLLM end-to-end inference latency, all without application changes or numerical errors.

Significance. If the empirical claims hold under scrutiny, UCCL-Zip provides a practical lossless alternative to quantization for reducing communication volume in distributed LLM training and inference. This could meaningfully alleviate bandwidth bottlenecks in large-scale GPU clusters while preserving correctness, with the fused-kernel approach representing a potentially reusable technique for other communication libraries.

major comments (2)

[Abstract] Abstract: The central claim that Uzip-NCCL achieves net gains by fusing compression into NCCL persistent kernels (eliminating redundant memory traffic and kernel launches) is load-bearing for the reported speedups, yet the abstract supplies no microbenchmark isolation of compression compute latency, temporary buffer overhead, or intra-kernel synchronization costs versus bandwidth savings on the exact tensor shapes used in the RL and vLLM experiments.
[Abstract] Abstract: The reported 47.5% RL synchronization and 10% vLLM latency improvements are presented without any description of baselines, workload tensor dimensions, number of runs, error bars, or breakdown of compression overhead relative to pure communication time; this makes it impossible to determine whether the gains are robust or specific to communication-dominant regimes.

minor comments (1)

[Abstract] The abstract would benefit from a brief statement of the specific lossless compression algorithm (e.g., LZ4, Zstd, or custom) and its block size to allow readers to assess GPU efficiency claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that additional context on microbenchmarks and experimental details would improve clarity and will revise the abstract to incorporate key points while preserving conciseness. We address each comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that Uzip-NCCL achieves net gains by fusing compression into NCCL persistent kernels (eliminating redundant memory traffic and kernel launches) is load-bearing for the reported speedups, yet the abstract supplies no microbenchmark isolation of compression compute latency, temporary buffer overhead, or intra-kernel synchronization costs versus bandwidth savings on the exact tensor shapes used in the RL and vLLM experiments.

Authors: The full manuscript includes microbenchmarks that isolate compression compute latency, temporary buffer overhead, and intra-kernel synchronization costs against bandwidth savings for the exact tensor shapes and sizes appearing in the RL and vLLM workloads. These measurements confirm that the fused-kernel design yields net gains because bandwidth reduction outweighs the added compute and synchronization costs in the relevant regimes. We will revise the abstract to briefly reference these isolation results and the observed net positive impact. revision: yes
Referee: [Abstract] Abstract: The reported 47.5% RL synchronization and 10% vLLM latency improvements are presented without any description of baselines, workload tensor dimensions, number of runs, error bars, or breakdown of compression overhead relative to pure communication time; this makes it impossible to determine whether the gains are robust or specific to communication-dominant regimes.

Authors: The manuscript details the baselines (standard NCCL without compression), workload tensor dimensions, number of runs with error bars, and overhead breakdowns relative to pure communication time in the evaluation sections. The reported speedups are shown to be robust in communication-dominant regimes across the tested configurations. We will update the abstract to include a concise statement of the experimental conditions and direct readers to the evaluation for the full breakdown. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system measurements with no derivation chain

full rationale

The paper describes a systems design (Uzip-P2P split-send pipeline and Uzip-NCCL fused persistent kernels) and supports its claims exclusively through empirical timing measurements on RL weight sync and vLLM workloads. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the abstract or described content. Performance numbers (47.5% and 10%) are reported as direct observations rather than predictions derived from the design by construction. The central assumption about fusion overhead is tested experimentally, not presupposed mathematically.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a systems-engineering paper whose central claims rest on standard assumptions about GPU hardware interfaces and the feasibility of kernel fusion rather than new mathematical axioms or fitted parameters.

axioms (1)

domain assumption Existing GPU communication libraries (NCCL, P2P primitives) can be extended with compression kernels without breaking compatibility or introducing prohibitive overhead.
Invoked when describing Uzip-P2P and Uzip-NCCL integration.

pith-pipeline@v0.9.0 · 5506 in / 1200 out tokens · 53852 ms · 2026-05-10T06:31:26.765075+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 39 canonical work pages · 7 internal anchors

[1]

Dan Alistarh, Demjan Grubic, Jerry Li, et al . 2017. QSGD: Communication-Efficient SGD via Gradient Quantization and Encod- ing. InAdvances in Neural Information Processing Systems (NeurIPS ’17)

2017
[2]

AMD. 2024. RCCL: AMD ROCm Collective Communication Library. https://github.com/ROCmSoftwarePlatform/rccl. Accessed: 2026

2024
[3]

Noushin Azami, Alex Fallin, and Martin Burtscher. 2025. Efficient Lossless Compression of Scientific Floating-Point Data on CPUs and GPUs. InProceedings of the 30th ACM International Conference on Ar- chitectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS ’25). doi:10.1145/3669940.3707280

work page doi:10.1145/3669940.3707280 2025
[4]

Jinze Bai, Shuai Bai, Yunfei Chu, et al. 2023. Qwen Technical Report. arXiv:2309.16609 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Jiamin Cao, Yu Guan, Kun Qian, et al. 2024. Crux: GPU-Efficient Com- munication Scheduling for Deep Learning Training. InProceedings of the ACM SIGCOMM 2024 Conference (ACM SIGCOMM ’24). Association for Computing Machinery, 1–15. doi:10.1145/3651890.3672239

work page doi:10.1145/3651890.3672239 2024
[6]

Li-Wen Chang, Wenlei Bao, Qi Hou, et al. 2024. FLUX: Fast Software- based Communication Overlap On GPUs Through Kernel Fusion. arXiv:2406.06858 [cs.LG]

work page arXiv 2024
[7]

Chuyan Chen, Yutong He, Pengrui Li, et al. 2025. Greedy Low-Rank Gradient Compression for Distributed Learning with Convergence Guarantees. arXiv:2507.08784 [cs.LG]

work page arXiv 2025
[8]

Chen-Chun Chen, Yu-Min Chou, and Jerry Chou. 2023. PHY: A Performance-Driven Hybrid Communication Compression Method for Distributed Training.J. Parallel and Distrib. Comput.180 (2023), 104719. doi:10.1016/j.jpdc.2023.104719

work page doi:10.1016/j.jpdc.2023.104719 2023
[9]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sashank Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bra...
[10]

PaLM: Scaling Language Modeling with Pathways.J. Mach. Learn. Res.24, 1, Article 240 (2023), 113 pages

2023
[11]

Weihao Cui, Yukang Chen, Han Zhao, Ziyi Xu, Quan Chen, Xusheng Chen, Zhou Yangjie, Shixuan Sun, and Minyi Guo
[12]

arXiv:2504.14489v1 [cs.OS]

Optimizing SLO-oriented LLM Serving with PD-Multiplexing. arXiv:2504.14489v1 [cs.OS]

work page arXiv
[13]

DeepSeek-AI, Aixin Liu, Bei Feng, et al. 2025. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Ruibo Fan, Xiangrui Yu, Xinglin Pan, Zeyu Li, Weile Luo, Qiang Wang, Wei Wang, and Xiaowen Chu. 2026. ZipServ: Fast and Memory- Efficient LLM Inference with Hardware-Aware Lossless Compression. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’26). Association...

work page doi:10.1145/3779212.3790250 2026
[15]

Sahu, et al

Jiawei Fei, Chen-Yu Ho, Atal N. Sahu, et al. 2021. Efficient Sparse Col- lective Communication and Its Application to Accelerate Distributed Deep Learning. InProceedings of the 2021 ACM SIGCOMM Conference (SIGCOMM ’21). 676–691. doi:10.1145/3452296.3472904

work page doi:10.1145/3452296.3472904 2021
[16]

Tianxiang Gao, Xiaokai Huo, Hailiang Liu, and Hongyang Gao. 2023. Wide neural networks as Gaussian processes: lessons from deep equi- librium models. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA) (NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 2397, 34 pages

2023
[17]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Song Han, Huizi Mao, and William J. Dally. 2015. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantiza- tion and Huffman Coding. arXiv:1510.00149 [cs.CV]

work page internal anchor Pith review arXiv 2015
[19]

Yongchang Hao, Yanshuai Cao, and Lili Mou. 2024. NeuZip: Memory- Efficient Training and Inference with Dynamic Compression of Neural Networks. arXiv:2410.20650 [cs.LG]

work page arXiv 2024
[20]

Horace He and Thinking Machines Lab. 2025. Defeating Nondetermin- ism in LLM Inference.https://thinkingmachines.ai/blog/defeating- nondeterminism-in-llm-inference/. Thinking Machines Lab: Connec- tionism

2025
[21]

Moshik Hershcovitch, Andrew Wood, Leshem Choshen, Guy Girmon- sky, Roy Leibovitz, Ilias Ennmouri, Michal Malka, Peter Chin, Swami- nathan Sundararaman, and Danny Harnik. 2025. ZipNN: Lossless Compression for AI Models. In2025 IEEE 18th International Conference on Cloud Computing (CLOUD). 186–198. doi:10.1109/CLOUD67622.2 025.00028

work page doi:10.1109/cloud67622.2 2025
[22]

Zhiyi Hu, Siyuan Shen, Tommaso Bonato, et al. 2025. Demystifying NCCL: An In-Depth Analysis of GPU Communication Protocols and Algorithms. In2025 IEEE Symposium on High-Performance Intercon- nects (HOTI). 48–59. doi:10.1109/HOTI66940.2025.00024

work page doi:10.1109/hoti66940.2025.00024 2025
[23]

Jiajun Huang, Sheng Di, Yafan Huang, et al . 2025. GhZCCL: Ad- vancing GPU-aware Collective Communications with Homomorphic Compression. InProceedings of the 39th ACM International Conference on Supercomputing (ICS ’25). 43–56. doi:10.1145/3721145.3733642

work page doi:10.1145/3721145.3733642 2025
[24]

Jiajun Huang, Sheng Di, Xiaodong Yu, et al . 2024. gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters. InProceedings of the 38th ACM International Conference on Supercomputing (ICS ’24). 437–448. doi:10.1145/3650200.3656636

work page doi:10.1145/3650200.3656636 2024
[25]

Jiajun Huang, Sheng Di, Xiaodong Yu, et al . 2024. An Optimized Error-Controlled MPI Collective Framework Integrated with Lossy Compression. In2024 IEEE International Parallel and Distributed Pro- cessing Symposium (IPDPS). 752–764

2024
[26]

Jiajun Huang, Sheng Di, Xiaodong Yu, et al. 2025. ZCCL: Significantly Improving Collective Communication With Error-Bounded Lossy Compression. arXiv:2502.18554 [cs.DC]

work page arXiv 2025
[27]

Hoskins, Matthew W

Siyuan Huang, Brian D. Hoskins, Matthew W. Daniels, et al . 2023. Low-Rank Gradient Descent for Memory-Efficient Training of Deep In-Memory Arrays.Journal of Emerging Technologies in Computing Systems19, 2, Article 16 (2023), 24 pages. doi:10.1145/3577214

work page doi:10.1145/3577214 2023
[28]

Samirasadat Jamalidinan and Kazem Cheshmi. 2025. Floating- Point Data Transformation for Lossless Compression. 12 arXiv:2506.18062 [cs.DB]

work page arXiv 2025
[29]

Sylvain Jeaugey. 2017. NCCL: Optimized Primitives for Collective Multi-GPU Communication.https://developer.nvidia.com/nccl

2017
[30]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, et al . 2023. Efficient Memory Management for Large Language Model Serving with Page- dAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles. doi:10.1145/3600006.3613165

work page doi:10.1145/3600006.3613165 2023
[31]

Alexander Langer, Samuel Howell, Sreeram Potluri, et al. 2021. Dy- namic Symmetric Heap Allocation in NVSHMEM. InOpenSHMEM and Related Technologies (Lecture Notes in Computer Science). Springer, 187–198. doi:10.1007/978-3-031-04888-3_12

work page doi:10.1007/978-3-031-04888-3_12 2021
[32]

Minghao Li, Ran Ben Basat, Shay Vargaftik, et al . 2024. THC: Ac- celerating Distributed Deep Learning Using Tensor Homomorphic Compression. arXiv:2302.08545 [cs.LG]

work page arXiv 2024
[33]

Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang Wang, and Yang You. 2023. Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training. In Proceedings of the 52nd International Conference on Parallel Processing (ICPP ’23). Association for Computing Machinery, 766–775. doi:10.114 5/3605573.3605613

work page arXiv 2023
[34]

Xue Li, Cheng Guo, Kun Qian, et al . 2024. Near-Lossless Gradient Compression for Data-Parallel Distributed DNN Training. InProceed- ings of the ACM Symposium on Cloud Computing (SoCC ’24). 977–994. doi:10.1145/3698038.3698541

work page doi:10.1145/3698038.3698541 2024
[35]

Nandor Licker, Kevin Hu, Vladimir Zaytsev, and Lequn Chen
[36]

fabric-lib: RDMA Point-to-Point Communication for LLM Systems

RDMA Point-to-Point Communication for LLM Systems. arXiv:2510.27656 [cs.DC]

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Meta AI Research. 2026. DietGPU.https://github.com/facebookresea rch/dietgpu. GitHub repository, accessed 2026-03-07

2026
[38]

Mooncake Project. 2024. Mooncake Transfer Engine.https://github.c om/kvcache-ai/Mooncake. Accessed: 2026

2024
[39]

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. InProceedings of the Interna- tional Conference for Hi...

work page doi:10.1145/3458817.3476209 2021
[40]

NVIDIA. 2023. nvCOMP: NVIDIA GPU Data Compression Library. https://github.com/NVIDIA/nvcomp. Accessed: July 31, 2023

2023
[41]

NVIDIA. 2023. NVIDIA CUDA C Programming Guide.https://docs.n vidia.com/cuda/cuda-c-programming-guide/

2023
[42]

NVIDIA. 2025. NIXL: NVIDIA Inference Xfer Library.https://github .com/ai-dynamo/nixl

2025
[43]

Qwen, An Yang, Baosong Yang, et al. 2025. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Sudarsanan Rajasekaran, Manya Ghobadi, and Aditya Akella. 2024. CASSINI: Network-Aware Job Scheduling in Machine Learning Clus- ters. InProceedings of the 21st USENIX Symposium on Networked Sys- tems Design and Implementation (NSDI ’24). USENIX Association, Ar- ticle 78, 18 pages

2024
[45]

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He
[46]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

DeepSpeed: System Optimizations Enable Training Deep Learn- ing Models with Over 100 Billion Parameters. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’20). Association for Computing Machinery, 3505–3506. doi:10.1145/3394486.3406703

work page doi:10.1145/3394486.3406703
[47]

Aashaka Shah, Abhinav Jangda, Binyang Li, Caio Rocha, Changho Hwang, Jithin Jose, Madan Musuvathi, Olli Saarikivi, Peng Cheng, Qinghua Zhou, Roshan Dathathri, Saeed Maleki, and Ziyue Yang
[48]

Msccl++: Rethinking gpu communication abstractions for cutting-edge ai applications, 2025

MSCCL++: Rethinking GPU Communication Abstractions for Cutting-edge AI Applications. arXiv:2504.09014 [cs.DC]

work page arXiv
[49]

Gregory Pauloski, et al

Baixi Sun, Weijin Liu, J. Gregory Pauloski, et al. 2025. COMPSO: Opti- mizing Gradient Compression for Distributed Training with Second- Order Optimizers. InProceedings of the 30th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’25). 212–224. doi:10.1145/3710848.3710852

work page doi:10.1145/3710848.3710852 2025
[50]

Yuki Takezawa, Kenta Niwa, and Makoto Yamada. 2023. Communica- tion Compression for Decentralized Learning With Operator Splitting Methods.IEEE Transactions on Signal and Information Processing over Networks9 (2023), 581–595. doi:10.1109/TSIPN.2023.3307894

work page doi:10.1109/tsipn.2023.3307894 2023
[51]

UCCL Project. 2024. KV Transfer Engine: High-Performance GPU Communication in UCCL.https://uccl-project.github.io/posts/kv- transfer-engine/. Accessed: 2026

2024
[52]

V Team, Wenyi Hong, et al. 2025. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning. arXiv:2507.01006 [cs.CV]

work page internal anchor Pith review arXiv 2025
[53]

Guanhua Wang, Heyang Qin, Sam Ade Jacobs, et al. 2023. ZeRO++: Ex- tremely Efficient Collective Communication for Giant Model Training. arXiv:2306.10209 [cs.DC]

work page arXiv 2023
[54]

Ceyu Xu, Yongji Wu, Xinyu Yang, Beidi Chen, Matthew Lentz, Danyang Zhuo, and Lisa Wu Wills. 2025. LLM.265: Video Codecs are Secretly Tensor Codecs. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture (MICRO ’25). Asso- ciation for Computing Machinery, New York, NY, USA, 445–460. doi:10.1145/3725843.3756078

work page doi:10.1145/3725843.3756078 2025
[55]

Abdelmoniem, et al

Hang Xu, Chen-Yu Ho, Ahmed M. Abdelmoniem, et al. 2021. GRACE: A Compressed Communication Framework for Distributed Machine Learning. In2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS) (ICDCS ’21). 561–572. doi:10.1109/ICDC S51616.2021.00060

work page doi:10.1109/icdc 2021
[56]

Annie Yang, Hari Mukka, Farbod Hesaaraki, and Martin Burtscher
[57]

In2015 IEEE International Conference on Cluster Computing

MPC: A Massively Parallel Compression Algorithm for Scientific Data. In2015 IEEE International Conference on Cluster Computing. 381–
[58]

doi:10.1109/CLUSTER.2015.59

work page doi:10.1109/cluster.2015.59 2015
[59]

Tianyi Zhang, Mohsen Hariri, Shaochen Zhong, et al . 2025. 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11). InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

2025
[60]

Q. Zhou, C. Chu, N. S. Kumar, et al. 2021. Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters. In2021 IEEE International Parallel and Distributed Processing Sympo- sium (IPDPS) (IPDPS ’21). 444–453. doi:10.1109/IPDPS49936.2021.00053

work page doi:10.1109/ipdps49936.2021.00053 2021
[61]

Yang Zhou, Zhongjie Chen, Ziming Mao, et al. 2025. An Extensible Soft- ware Transport Layer for GPU Networking. arXiv:2504.17307 [cs.NI] 13

work page arXiv 2025