Recognition: unknown
UCCL-Zip: Lossless Compression Supercharged GPU Communication
Pith reviewed 2026-05-10 06:31 UTC · model grok-4.3
The pith
Lossless compression fuses into GPU communication kernels to cut synchronization time without errors or API changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UCCL-Zip integrates lossless compression directly into GPU communication primitives. For point-to-point communication it uses a split-send pipeline that exposes transmissible data early and overlaps compression with communication while operating on large data blocks. For collective communication it fuses compression into NCCL's persistent kernel model, eliminating redundant memory traffic and kernel launches. The design supports both patterns without modifying user-facing APIs and without compromising numerical correctness.
What carries the argument
The split-send pipeline for P2P and the fused compression step inside NCCL persistent kernels, which together allow compression to run concurrently with data transfer and remove extra memory operations.
If this is right
- RL weight synchronization accelerates by up to 47.5 percent.
- vLLM end-to-end inference latency drops by up to 10 percent.
- No changes are required to application source code or APIs.
- Both point-to-point and collective patterns remain fully supported.
- Numerical results stay identical to the uncompressed case.
Where Pith is reading between the lines
- The same fusion pattern could be applied to additional collective operations such as all-reduce variants if the kernel integration cost remains low.
- Shorter synchronization times might enable more frequent model updates in distributed reinforcement learning without raising total wall-clock time.
- Lower communication duration could reduce the fraction of time GPUs spend idle, improving overall cluster throughput in multi-tenant environments.
- If the overhead stays favorable on newer interconnects, communication libraries might adopt fused lossless compression as a default option.
Load-bearing premise
The time cost of performing lossless compression and decompression on the GPU stays small enough relative to the bandwidth saved that overall communication finishes faster across real workloads.
What would settle it
A direct timing measurement on an LLM workload showing that the added compression plus decompression latency exceeds the reduction in network transfer time for the observed data volumes and link speeds.
Figures
read the original abstract
The rapid growth of large language models (LLMs) has made GPU communication a critical bottleneck. While prior work reduces communication volume via quantization or lossy compression, these approaches introduce numerical errors that can degrade convergence, accuracy, and stability. We present UCCL-Zip, a unified design that integrates lossless compression directly into GPU communication primitives. UCCL-Zip supports both point-to-point (P2P) and collective communication without modifying user-facing APIs or compromising numerical correctness. For P2P communication, Uzip-P2P employs a split-send pipeline that exposes transmissible data early and overlaps compression with communication, while preserving high GPU efficiency by operating on large data blocks. For collective communication, Uzip-NCCL integrates compression into NCCL's persistent kernel model via fused execution, eliminating redundant memory traffic and kernel launches. In real workloads, UCCL-Zip accelerates RL weight synchronization by up to 47.5% and reduces vLLM end-to-end inference latency by up to 10%, all without application changes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents UCCL-Zip, a system that integrates lossless compression directly into GPU communication primitives for both point-to-point (P2P) and collective operations. For P2P, it uses a split-send pipeline to overlap compression with communication; for collectives, it fuses compression into NCCL persistent kernels to eliminate redundant traffic and launches. The work claims up to 47.5% speedup in RL weight synchronization and up to 10% reduction in vLLM end-to-end inference latency, all without application changes or numerical errors.
Significance. If the empirical claims hold under scrutiny, UCCL-Zip provides a practical lossless alternative to quantization for reducing communication volume in distributed LLM training and inference. This could meaningfully alleviate bandwidth bottlenecks in large-scale GPU clusters while preserving correctness, with the fused-kernel approach representing a potentially reusable technique for other communication libraries.
major comments (2)
- [Abstract] Abstract: The central claim that Uzip-NCCL achieves net gains by fusing compression into NCCL persistent kernels (eliminating redundant memory traffic and kernel launches) is load-bearing for the reported speedups, yet the abstract supplies no microbenchmark isolation of compression compute latency, temporary buffer overhead, or intra-kernel synchronization costs versus bandwidth savings on the exact tensor shapes used in the RL and vLLM experiments.
- [Abstract] Abstract: The reported 47.5% RL synchronization and 10% vLLM latency improvements are presented without any description of baselines, workload tensor dimensions, number of runs, error bars, or breakdown of compression overhead relative to pure communication time; this makes it impossible to determine whether the gains are robust or specific to communication-dominant regimes.
minor comments (1)
- [Abstract] The abstract would benefit from a brief statement of the specific lossless compression algorithm (e.g., LZ4, Zstd, or custom) and its block size to allow readers to assess GPU efficiency claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We agree that additional context on microbenchmarks and experimental details would improve clarity and will revise the abstract to incorporate key points while preserving conciseness. We address each comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that Uzip-NCCL achieves net gains by fusing compression into NCCL persistent kernels (eliminating redundant memory traffic and kernel launches) is load-bearing for the reported speedups, yet the abstract supplies no microbenchmark isolation of compression compute latency, temporary buffer overhead, or intra-kernel synchronization costs versus bandwidth savings on the exact tensor shapes used in the RL and vLLM experiments.
Authors: The full manuscript includes microbenchmarks that isolate compression compute latency, temporary buffer overhead, and intra-kernel synchronization costs against bandwidth savings for the exact tensor shapes and sizes appearing in the RL and vLLM workloads. These measurements confirm that the fused-kernel design yields net gains because bandwidth reduction outweighs the added compute and synchronization costs in the relevant regimes. We will revise the abstract to briefly reference these isolation results and the observed net positive impact. revision: yes
-
Referee: [Abstract] Abstract: The reported 47.5% RL synchronization and 10% vLLM latency improvements are presented without any description of baselines, workload tensor dimensions, number of runs, error bars, or breakdown of compression overhead relative to pure communication time; this makes it impossible to determine whether the gains are robust or specific to communication-dominant regimes.
Authors: The manuscript details the baselines (standard NCCL without compression), workload tensor dimensions, number of runs with error bars, and overhead breakdowns relative to pure communication time in the evaluation sections. The reported speedups are shown to be robust in communication-dominant regimes across the tested configurations. We will update the abstract to include a concise statement of the experimental conditions and direct readers to the evaluation for the full breakdown. revision: yes
Circularity Check
No circularity: empirical system measurements with no derivation chain
full rationale
The paper describes a systems design (Uzip-P2P split-send pipeline and Uzip-NCCL fused persistent kernels) and supports its claims exclusively through empirical timing measurements on RL weight sync and vLLM workloads. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the abstract or described content. Performance numbers (47.5% and 10%) are reported as direct observations rather than predictions derived from the design by construction. The central assumption about fusion overhead is tested experimentally, not presupposed mathematically.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing GPU communication libraries (NCCL, P2P primitives) can be extended with compression kernels without breaking compatibility or introducing prohibitive overhead.
Reference graph
Works this paper leans on
-
[1]
Dan Alistarh, Demjan Grubic, Jerry Li, et al . 2017. QSGD: Communication-Efficient SGD via Gradient Quantization and Encod- ing. InAdvances in Neural Information Processing Systems (NeurIPS ’17)
2017
-
[2]
AMD. 2024. RCCL: AMD ROCm Collective Communication Library. https://github.com/ROCmSoftwarePlatform/rccl. Accessed: 2026
2024
-
[3]
Noushin Azami, Alex Fallin, and Martin Burtscher. 2025. Efficient Lossless Compression of Scientific Floating-Point Data on CPUs and GPUs. InProceedings of the 30th ACM International Conference on Ar- chitectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS ’25). doi:10.1145/3669940.3707280
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, et al. 2023. Qwen Technical Report. arXiv:2309.16609 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Jiamin Cao, Yu Guan, Kun Qian, et al. 2024. Crux: GPU-Efficient Com- munication Scheduling for Deep Learning Training. InProceedings of the ACM SIGCOMM 2024 Conference (ACM SIGCOMM ’24). Association for Computing Machinery, 1–15. doi:10.1145/3651890.3672239
- [6]
- [7]
-
[8]
Chen-Chun Chen, Yu-Min Chou, and Jerry Chou. 2023. PHY: A Performance-Driven Hybrid Communication Compression Method for Distributed Training.J. Parallel and Distrib. Comput.180 (2023), 104719. doi:10.1016/j.jpdc.2023.104719
-
[9]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sashank Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bra...
-
[10]
PaLM: Scaling Language Modeling with Pathways.J. Mach. Learn. Res.24, 1, Article 240 (2023), 113 pages
2023
-
[11]
Weihao Cui, Yukang Chen, Han Zhao, Ziyi Xu, Quan Chen, Xusheng Chen, Zhou Yangjie, Shixuan Sun, and Minyi Guo
-
[12]
Optimizing SLO-oriented LLM Serving with PD-Multiplexing. arXiv:2504.14489v1 [cs.OS]
-
[13]
DeepSeek-AI, Aixin Liu, Bei Feng, et al. 2025. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Ruibo Fan, Xiangrui Yu, Xinglin Pan, Zeyu Li, Weile Luo, Qiang Wang, Wei Wang, and Xiaowen Chu. 2026. ZipServ: Fast and Memory- Efficient LLM Inference with Hardware-Aware Lossless Compression. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’26). Association...
-
[15]
Jiawei Fei, Chen-Yu Ho, Atal N. Sahu, et al. 2021. Efficient Sparse Col- lective Communication and Its Application to Accelerate Distributed Deep Learning. InProceedings of the 2021 ACM SIGCOMM Conference (SIGCOMM ’21). 676–691. doi:10.1145/3452296.3472904
-
[16]
Tianxiang Gao, Xiaokai Huo, Hailiang Liu, and Hongyang Gao. 2023. Wide neural networks as Gaussian processes: lessons from deep equi- librium models. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA) (NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 2397, 34 pages
2023
-
[17]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Song Han, Huizi Mao, and William J. Dally. 2015. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantiza- tion and Huffman Coding. arXiv:1510.00149 [cs.CV]
work page internal anchor Pith review arXiv 2015
- [19]
-
[20]
Horace He and Thinking Machines Lab. 2025. Defeating Nondetermin- ism in LLM Inference.https://thinkingmachines.ai/blog/defeating- nondeterminism-in-llm-inference/. Thinking Machines Lab: Connec- tionism
2025
-
[21]
Moshik Hershcovitch, Andrew Wood, Leshem Choshen, Guy Girmon- sky, Roy Leibovitz, Ilias Ennmouri, Michal Malka, Peter Chin, Swami- nathan Sundararaman, and Danny Harnik. 2025. ZipNN: Lossless Compression for AI Models. In2025 IEEE 18th International Conference on Cloud Computing (CLOUD). 186–198. doi:10.1109/CLOUD67622.2 025.00028
-
[22]
Zhiyi Hu, Siyuan Shen, Tommaso Bonato, et al. 2025. Demystifying NCCL: An In-Depth Analysis of GPU Communication Protocols and Algorithms. In2025 IEEE Symposium on High-Performance Intercon- nects (HOTI). 48–59. doi:10.1109/HOTI66940.2025.00024
-
[23]
Jiajun Huang, Sheng Di, Yafan Huang, et al . 2025. GhZCCL: Ad- vancing GPU-aware Collective Communications with Homomorphic Compression. InProceedings of the 39th ACM International Conference on Supercomputing (ICS ’25). 43–56. doi:10.1145/3721145.3733642
-
[24]
Jiajun Huang, Sheng Di, Xiaodong Yu, et al . 2024. gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters. InProceedings of the 38th ACM International Conference on Supercomputing (ICS ’24). 437–448. doi:10.1145/3650200.3656636
-
[25]
Jiajun Huang, Sheng Di, Xiaodong Yu, et al . 2024. An Optimized Error-Controlled MPI Collective Framework Integrated with Lossy Compression. In2024 IEEE International Parallel and Distributed Pro- cessing Symposium (IPDPS). 752–764
2024
- [26]
-
[27]
Siyuan Huang, Brian D. Hoskins, Matthew W. Daniels, et al . 2023. Low-Rank Gradient Descent for Memory-Efficient Training of Deep In-Memory Arrays.Journal of Emerging Technologies in Computing Systems19, 2, Article 16 (2023), 24 pages. doi:10.1145/3577214
- [28]
-
[29]
Sylvain Jeaugey. 2017. NCCL: Optimized Primitives for Collective Multi-GPU Communication.https://developer.nvidia.com/nccl
2017
-
[30]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, et al . 2023. Efficient Memory Management for Large Language Model Serving with Page- dAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles. doi:10.1145/3600006.3613165
-
[31]
Alexander Langer, Samuel Howell, Sreeram Potluri, et al. 2021. Dy- namic Symmetric Heap Allocation in NVSHMEM. InOpenSHMEM and Related Technologies (Lecture Notes in Computer Science). Springer, 187–198. doi:10.1007/978-3-031-04888-3_12
- [32]
-
[33]
Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang Wang, and Yang You. 2023. Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training. In Proceedings of the 52nd International Conference on Parallel Processing (ICPP ’23). Association for Computing Machinery, 766–775. doi:10.114 5/3605573.3605613
-
[34]
Xue Li, Cheng Guo, Kun Qian, et al . 2024. Near-Lossless Gradient Compression for Data-Parallel Distributed DNN Training. InProceed- ings of the ACM Symposium on Cloud Computing (SoCC ’24). 977–994. doi:10.1145/3698038.3698541
-
[35]
Nandor Licker, Kevin Hu, Vladimir Zaytsev, and Lequn Chen
-
[36]
fabric-lib: RDMA Point-to-Point Communication for LLM Systems
RDMA Point-to-Point Communication for LLM Systems. arXiv:2510.27656 [cs.DC]
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Meta AI Research. 2026. DietGPU.https://github.com/facebookresea rch/dietgpu. GitHub repository, accessed 2026-03-07
2026
-
[38]
Mooncake Project. 2024. Mooncake Transfer Engine.https://github.c om/kvcache-ai/Mooncake. Accessed: 2026
2024
-
[39]
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. InProceedings of the Interna- tional Conference for Hi...
-
[40]
NVIDIA. 2023. nvCOMP: NVIDIA GPU Data Compression Library. https://github.com/NVIDIA/nvcomp. Accessed: July 31, 2023
2023
-
[41]
NVIDIA. 2023. NVIDIA CUDA C Programming Guide.https://docs.n vidia.com/cuda/cuda-c-programming-guide/
2023
-
[42]
NVIDIA. 2025. NIXL: NVIDIA Inference Xfer Library.https://github .com/ai-dynamo/nixl
2025
-
[43]
Qwen, An Yang, Baosong Yang, et al. 2025. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Sudarsanan Rajasekaran, Manya Ghobadi, and Aditya Akella. 2024. CASSINI: Network-Aware Job Scheduling in Machine Learning Clus- ters. InProceedings of the 21st USENIX Symposium on Networked Sys- tems Design and Implementation (NSDI ’24). USENIX Association, Ar- ticle 78, 18 pages
2024
-
[45]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He
-
[46]
DeepSpeed: System Optimizations Enable Training Deep Learn- ing Models with Over 100 Billion Parameters. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’20). Association for Computing Machinery, 3505–3506. doi:10.1145/3394486.3406703
-
[47]
Aashaka Shah, Abhinav Jangda, Binyang Li, Caio Rocha, Changho Hwang, Jithin Jose, Madan Musuvathi, Olli Saarikivi, Peng Cheng, Qinghua Zhou, Roshan Dathathri, Saeed Maleki, and Ziyue Yang
-
[48]
Msccl++: Rethinking gpu communication abstractions for cutting-edge ai applications, 2025
MSCCL++: Rethinking GPU Communication Abstractions for Cutting-edge AI Applications. arXiv:2504.09014 [cs.DC]
-
[49]
Baixi Sun, Weijin Liu, J. Gregory Pauloski, et al. 2025. COMPSO: Opti- mizing Gradient Compression for Distributed Training with Second- Order Optimizers. InProceedings of the 30th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’25). 212–224. doi:10.1145/3710848.3710852
-
[50]
Yuki Takezawa, Kenta Niwa, and Makoto Yamada. 2023. Communica- tion Compression for Decentralized Learning With Operator Splitting Methods.IEEE Transactions on Signal and Information Processing over Networks9 (2023), 581–595. doi:10.1109/TSIPN.2023.3307894
-
[51]
UCCL Project. 2024. KV Transfer Engine: High-Performance GPU Communication in UCCL.https://uccl-project.github.io/posts/kv- transfer-engine/. Accessed: 2026
2024
-
[52]
V Team, Wenyi Hong, et al. 2025. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning. arXiv:2507.01006 [cs.CV]
work page internal anchor Pith review arXiv 2025
- [53]
-
[54]
Ceyu Xu, Yongji Wu, Xinyu Yang, Beidi Chen, Matthew Lentz, Danyang Zhuo, and Lisa Wu Wills. 2025. LLM.265: Video Codecs are Secretly Tensor Codecs. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture (MICRO ’25). Asso- ciation for Computing Machinery, New York, NY, USA, 445–460. doi:10.1145/3725843.3756078
-
[55]
Hang Xu, Chen-Yu Ho, Ahmed M. Abdelmoniem, et al. 2021. GRACE: A Compressed Communication Framework for Distributed Machine Learning. In2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS) (ICDCS ’21). 561–572. doi:10.1109/ICDC S51616.2021.00060
-
[56]
Annie Yang, Hari Mukka, Farbod Hesaaraki, and Martin Burtscher
-
[57]
In2015 IEEE International Conference on Cluster Computing
MPC: A Massively Parallel Compression Algorithm for Scientific Data. In2015 IEEE International Conference on Cluster Computing. 381–
-
[58]
doi:10.1109/CLUSTER.2015.59
-
[59]
Tianyi Zhang, Mohsen Hariri, Shaochen Zhong, et al . 2025. 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11). InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
2025
-
[60]
Q. Zhou, C. Chu, N. S. Kumar, et al. 2021. Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters. In2021 IEEE International Parallel and Distributed Processing Sympo- sium (IPDPS) (IPDPS ’21). 444–453. doi:10.1109/IPDPS49936.2021.00053
- [61]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.