pith. machine review for the scientific record. sign in

arxiv: 2604.17172 · v2 · submitted 2026-04-19 · 💻 cs.DC · cs.AI

Recognition: unknown

UCCL-Zip: Lossless Compression Supercharged GPU Communication

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:31 UTC · model grok-4.3

classification 💻 cs.DC cs.AI
keywords lossless compressionGPU communicationpoint-to-pointNCCLcollective communicationLLM trainingreinforcement learningdistributed inference
0
0 comments X

The pith

Lossless compression fuses into GPU communication kernels to cut synchronization time without errors or API changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that lossless compression can be embedded directly inside existing GPU point-to-point and collective communication paths. It does so by redesigning the data movement pipelines and kernel execution models so that compression and transmission overlap on large blocks while preserving exact numerical values. A sympathetic reader would care because GPU communication now dominates runtime in large-model training and serving, and earlier compression methods either risked accuracy loss or forced changes to application code. If the approach holds, distributed workloads can send less data per transfer and finish faster while remaining fully compatible with current libraries and programs.

Core claim

UCCL-Zip integrates lossless compression directly into GPU communication primitives. For point-to-point communication it uses a split-send pipeline that exposes transmissible data early and overlaps compression with communication while operating on large data blocks. For collective communication it fuses compression into NCCL's persistent kernel model, eliminating redundant memory traffic and kernel launches. The design supports both patterns without modifying user-facing APIs and without compromising numerical correctness.

What carries the argument

The split-send pipeline for P2P and the fused compression step inside NCCL persistent kernels, which together allow compression to run concurrently with data transfer and remove extra memory operations.

If this is right

  • RL weight synchronization accelerates by up to 47.5 percent.
  • vLLM end-to-end inference latency drops by up to 10 percent.
  • No changes are required to application source code or APIs.
  • Both point-to-point and collective patterns remain fully supported.
  • Numerical results stay identical to the uncompressed case.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fusion pattern could be applied to additional collective operations such as all-reduce variants if the kernel integration cost remains low.
  • Shorter synchronization times might enable more frequent model updates in distributed reinforcement learning without raising total wall-clock time.
  • Lower communication duration could reduce the fraction of time GPUs spend idle, improving overall cluster throughput in multi-tenant environments.
  • If the overhead stays favorable on newer interconnects, communication libraries might adopt fused lossless compression as a default option.

Load-bearing premise

The time cost of performing lossless compression and decompression on the GPU stays small enough relative to the bandwidth saved that overall communication finishes faster across real workloads.

What would settle it

A direct timing measurement on an LLM workload showing that the added compression plus decompression latency exceeds the reduction in network transfer time for the observed data volumes and link speeds.

Figures

Figures reproduced from arXiv: 2604.17172 by Chon Lam Lao, Delong Meng, Ion Stoica, Jia Zhen, Jun Wu, Shuang Ma, Yang Zhou, Yida Wang, Zhiying Xu, Zhuang Wang, Ziming Mao.

Figure 1
Figure 1. Figure 1: Overview of UCCL-Zip. (a) Naive design suffers from lack of overlap and additional kernel overhead. (b) Uzip-P2P enables early transmission and overlaps compression with communication; detailed steps are shown in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview and time breakdown of a typical GPU-based floating-point ANS compression pipeline (S1–S3 denote Steps 1–3). field and the remaining bits (sign and fraction), and com￾press only the exponent field. This design improves com￾pression efficiency compared to directly compressing raw floating-point representations [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of pipelining designs for overlapping com￾pression and communication. (b)(c) Chunk-based pipelining as￾sumes latency scales with size but is ineffective for GPU compres￾sion. (d) Split-send exposes transmissible data early and overlaps communication with remaining compression. early and overlapping communication with compute-heavy compression stages, split-send reduces communication stall time a… view at source ↗
Figure 5
Figure 5. Figure 5: Localized frequency tables eliminate global coordina￾tion and enable a fully fused compression pipeline. (a) A global frequency table requires cross-CTA synchronization, preventing kernel fusion and introducing additional memory passes. (b) With localized tables, each CTA independently samples and constructs its own table, enabling a fused pipeline within a single kernel without synchronization. (c) This d… view at source ↗
Figure 6
Figure 6. Figure 6: illustrates the workflow of NCCL all_reduce with compression. In the original NCCL pipeline, each GPU performs a sequence of receive–reduce–send steps for ev￾ery data slice. Our design extends this pipeline by inserting compression and decompression stages while preserving the streaming execution model. Specifically, data is compressed at the sender, transmitted in compressed form, and decom￾pressed at the… view at source ↗
Figure 7
Figure 7. Figure 7: Throughput comparison for bfloat16 peer-to-peer com￾munication across tensor sizes. 128KB 1MB 8MB 128MB 1GB Tensor Size 0 2 4 6 Throughput (GB/s) Uzip-NCCL NCCL (a) all_to_all throughput. 2MB 8MB 32MB 128MB 512MB 2GB Tensor Size 0 1 2 3 4 Throughput (GB/s) Uzip-NCCL NCCL (b) Ring all_reduce through￾put [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Throughput of Uzip-NCCL collective communication primitives across varying sizes. approaching the theoretical upper bound of 73.8 GB/s de￾rived from Amdahl’s Law under a 64% compression ratio. For smaller tensors, the benefits are more modest (e.g., 8% at 16 MB and 24% at 32 MB), as compression overhead partially offsets bandwidth savings. We generate bf16 tensors with values uniformly dis￾tributed in [−1,… view at source ↗
Figure 9
Figure 9. Figure 9: Throughput of two-shot all_reduce implemented with asynchronous isend/irecv on two p5en.48xlarge instances (16 GPUs total) with NVLink disabled. of two phases: reduce-scatter and all-gather. The all-gather phase involves only data movement without computation, and thus exhibits similar compression behavior across different implementations (e.g., ring-based and two-shot). In contrast, the reduce-scatter int… view at source ↗
Figure 10
Figure 10. Figure 10: Application-level evaluation of Uzip-P2P on bf16 weight tensors during RL training. (a) GLM4-9B and (b) Qwen3.5-35B-A3B. 7680 10240 15360 25600 51200 76800 102400 Input Token 0 1000 2000 Latency (ms) NCCL Uzip-NCCL [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: KV cache transfer latency under prefill–decode disag￾gregation (P1D3) in vLLM for the Qwen-7B-Chat model. two representative models: the dense GLM4-9B (9B param￾eters) and the mixture-of-experts (MoE) model Qwen3.5- 35B-A3B (35B parameters). The training pipeline runs on 8 GPUs, where 4 GPUs perform policy optimization and the remaining 4 GPUs generate rollouts [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Communication throughput for different versions of the gate_up_proj weight tensor (214 MB) during GLM4-9B RL training. 256KB 1MB 8MB 128MB 1GB Tensor Size 0 25 50 75 Throughput (GB/s) bfloat16 float16 float32 float8_e4m3fn float8_e5m2 UCCL-P2P (a) Uzip-P2P throughput across floating-point types. 16MB 64MB 256MB 1GB Tensor Size 0.00 0.25 0.50 0.75 1.00 Compression Ratio bfloat16 float16 float32 float8_e4m3… view at source ↗
Figure 13
Figure 13. Figure 13: Performance of Uzip-P2P across different floating-point data types. 5.3.2 KV Cache Transfer in Prefill–Decode Disaggre￾gation. We integrate Uzip-NCCL with the Prefill–Decode disaggregation inference pipeline of vLLM [28] to evaluate its performance in realistic distributed LLM serving workloads without application changes. Experiments follow the default Prefill–Decode disaggregation configuration (P1D3) i… view at source ↗
Figure 17
Figure 17. Figure 17: Uzip-P2P through￾put on AMD MI355X over Ro￾CEv2. 4MB 16MB 64MB 256MB 512MB 1GB Tensor Size 0 100 200 300 Throughput (GB/s) Uzip-NCCL NCCL 4 SMs 16 SMs 64 SMs [PITH_FULL_IMAGE:figures/full_fig_p011_17.png] view at source ↗
read the original abstract

The rapid growth of large language models (LLMs) has made GPU communication a critical bottleneck. While prior work reduces communication volume via quantization or lossy compression, these approaches introduce numerical errors that can degrade convergence, accuracy, and stability. We present UCCL-Zip, a unified design that integrates lossless compression directly into GPU communication primitives. UCCL-Zip supports both point-to-point (P2P) and collective communication without modifying user-facing APIs or compromising numerical correctness. For P2P communication, Uzip-P2P employs a split-send pipeline that exposes transmissible data early and overlaps compression with communication, while preserving high GPU efficiency by operating on large data blocks. For collective communication, Uzip-NCCL integrates compression into NCCL's persistent kernel model via fused execution, eliminating redundant memory traffic and kernel launches. In real workloads, UCCL-Zip accelerates RL weight synchronization by up to 47.5% and reduces vLLM end-to-end inference latency by up to 10%, all without application changes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents UCCL-Zip, a system that integrates lossless compression directly into GPU communication primitives for both point-to-point (P2P) and collective operations. For P2P, it uses a split-send pipeline to overlap compression with communication; for collectives, it fuses compression into NCCL persistent kernels to eliminate redundant traffic and launches. The work claims up to 47.5% speedup in RL weight synchronization and up to 10% reduction in vLLM end-to-end inference latency, all without application changes or numerical errors.

Significance. If the empirical claims hold under scrutiny, UCCL-Zip provides a practical lossless alternative to quantization for reducing communication volume in distributed LLM training and inference. This could meaningfully alleviate bandwidth bottlenecks in large-scale GPU clusters while preserving correctness, with the fused-kernel approach representing a potentially reusable technique for other communication libraries.

major comments (2)
  1. [Abstract] Abstract: The central claim that Uzip-NCCL achieves net gains by fusing compression into NCCL persistent kernels (eliminating redundant memory traffic and kernel launches) is load-bearing for the reported speedups, yet the abstract supplies no microbenchmark isolation of compression compute latency, temporary buffer overhead, or intra-kernel synchronization costs versus bandwidth savings on the exact tensor shapes used in the RL and vLLM experiments.
  2. [Abstract] Abstract: The reported 47.5% RL synchronization and 10% vLLM latency improvements are presented without any description of baselines, workload tensor dimensions, number of runs, error bars, or breakdown of compression overhead relative to pure communication time; this makes it impossible to determine whether the gains are robust or specific to communication-dominant regimes.
minor comments (1)
  1. [Abstract] The abstract would benefit from a brief statement of the specific lossless compression algorithm (e.g., LZ4, Zstd, or custom) and its block size to allow readers to assess GPU efficiency claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that additional context on microbenchmarks and experimental details would improve clarity and will revise the abstract to incorporate key points while preserving conciseness. We address each comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that Uzip-NCCL achieves net gains by fusing compression into NCCL persistent kernels (eliminating redundant memory traffic and kernel launches) is load-bearing for the reported speedups, yet the abstract supplies no microbenchmark isolation of compression compute latency, temporary buffer overhead, or intra-kernel synchronization costs versus bandwidth savings on the exact tensor shapes used in the RL and vLLM experiments.

    Authors: The full manuscript includes microbenchmarks that isolate compression compute latency, temporary buffer overhead, and intra-kernel synchronization costs against bandwidth savings for the exact tensor shapes and sizes appearing in the RL and vLLM workloads. These measurements confirm that the fused-kernel design yields net gains because bandwidth reduction outweighs the added compute and synchronization costs in the relevant regimes. We will revise the abstract to briefly reference these isolation results and the observed net positive impact. revision: yes

  2. Referee: [Abstract] Abstract: The reported 47.5% RL synchronization and 10% vLLM latency improvements are presented without any description of baselines, workload tensor dimensions, number of runs, error bars, or breakdown of compression overhead relative to pure communication time; this makes it impossible to determine whether the gains are robust or specific to communication-dominant regimes.

    Authors: The manuscript details the baselines (standard NCCL without compression), workload tensor dimensions, number of runs with error bars, and overhead breakdowns relative to pure communication time in the evaluation sections. The reported speedups are shown to be robust in communication-dominant regimes across the tested configurations. We will update the abstract to include a concise statement of the experimental conditions and direct readers to the evaluation for the full breakdown. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system measurements with no derivation chain

full rationale

The paper describes a systems design (Uzip-P2P split-send pipeline and Uzip-NCCL fused persistent kernels) and supports its claims exclusively through empirical timing measurements on RL weight sync and vLLM workloads. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the abstract or described content. Performance numbers (47.5% and 10%) are reported as direct observations rather than predictions derived from the design by construction. The central assumption about fusion overhead is tested experimentally, not presupposed mathematically.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a systems-engineering paper whose central claims rest on standard assumptions about GPU hardware interfaces and the feasibility of kernel fusion rather than new mathematical axioms or fitted parameters.

axioms (1)
  • domain assumption Existing GPU communication libraries (NCCL, P2P primitives) can be extended with compression kernels without breaking compatibility or introducing prohibitive overhead.
    Invoked when describing Uzip-P2P and Uzip-NCCL integration.

pith-pipeline@v0.9.0 · 5506 in / 1200 out tokens · 53852 ms · 2026-05-10T06:31:26.765075+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 39 canonical work pages · 7 internal anchors

  1. [1]

    Dan Alistarh, Demjan Grubic, Jerry Li, et al . 2017. QSGD: Communication-Efficient SGD via Gradient Quantization and Encod- ing. InAdvances in Neural Information Processing Systems (NeurIPS ’17)

  2. [2]

    AMD. 2024. RCCL: AMD ROCm Collective Communication Library. https://github.com/ROCmSoftwarePlatform/rccl. Accessed: 2026

  3. [3]

    Noushin Azami, Alex Fallin, and Martin Burtscher. 2025. Efficient Lossless Compression of Scientific Floating-Point Data on CPUs and GPUs. InProceedings of the 30th ACM International Conference on Ar- chitectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS ’25). doi:10.1145/3669940.3707280

  4. [4]

    Jinze Bai, Shuai Bai, Yunfei Chu, et al. 2023. Qwen Technical Report. arXiv:2309.16609 [cs.CL]

  5. [5]

    Jiamin Cao, Yu Guan, Kun Qian, et al. 2024. Crux: GPU-Efficient Com- munication Scheduling for Deep Learning Training. InProceedings of the ACM SIGCOMM 2024 Conference (ACM SIGCOMM ’24). Association for Computing Machinery, 1–15. doi:10.1145/3651890.3672239

  6. [6]

    Li-Wen Chang, Wenlei Bao, Qi Hou, et al. 2024. FLUX: Fast Software- based Communication Overlap On GPUs Through Kernel Fusion. arXiv:2406.06858 [cs.LG]

  7. [7]

    Chuyan Chen, Yutong He, Pengrui Li, et al. 2025. Greedy Low-Rank Gradient Compression for Distributed Learning with Convergence Guarantees. arXiv:2507.08784 [cs.LG]

  8. [8]

    Chen-Chun Chen, Yu-Min Chou, and Jerry Chou. 2023. PHY: A Performance-Driven Hybrid Communication Compression Method for Distributed Training.J. Parallel and Distrib. Comput.180 (2023), 104719. doi:10.1016/j.jpdc.2023.104719

  9. [9]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sashank Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bra...

  10. [10]

    PaLM: Scaling Language Modeling with Pathways.J. Mach. Learn. Res.24, 1, Article 240 (2023), 113 pages

  11. [11]

    Weihao Cui, Yukang Chen, Han Zhao, Ziyi Xu, Quan Chen, Xusheng Chen, Zhou Yangjie, Shixuan Sun, and Minyi Guo

  12. [12]

    arXiv:2504.14489v1 [cs.OS]

    Optimizing SLO-oriented LLM Serving with PD-Multiplexing. arXiv:2504.14489v1 [cs.OS]

  13. [13]

    DeepSeek-AI, Aixin Liu, Bei Feng, et al. 2025. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL]

  14. [14]

    Ruibo Fan, Xiangrui Yu, Xinglin Pan, Zeyu Li, Weile Luo, Qiang Wang, Wei Wang, and Xiaowen Chu. 2026. ZipServ: Fast and Memory- Efficient LLM Inference with Hardware-Aware Lossless Compression. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’26). Association...

  15. [15]

    Sahu, et al

    Jiawei Fei, Chen-Yu Ho, Atal N. Sahu, et al. 2021. Efficient Sparse Col- lective Communication and Its Application to Accelerate Distributed Deep Learning. InProceedings of the 2021 ACM SIGCOMM Conference (SIGCOMM ’21). 676–691. doi:10.1145/3452296.3472904

  16. [16]

    Tianxiang Gao, Xiaokai Huo, Hailiang Liu, and Hongyang Gao. 2023. Wide neural networks as Gaussian processes: lessons from deep equi- librium models. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA) (NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 2397, 34 pages

  17. [17]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI]

  18. [18]

    Song Han, Huizi Mao, and William J. Dally. 2015. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantiza- tion and Huffman Coding. arXiv:1510.00149 [cs.CV]

  19. [19]

    Yongchang Hao, Yanshuai Cao, and Lili Mou. 2024. NeuZip: Memory- Efficient Training and Inference with Dynamic Compression of Neural Networks. arXiv:2410.20650 [cs.LG]

  20. [20]

    Horace He and Thinking Machines Lab. 2025. Defeating Nondetermin- ism in LLM Inference.https://thinkingmachines.ai/blog/defeating- nondeterminism-in-llm-inference/. Thinking Machines Lab: Connec- tionism

  21. [21]

    Moshik Hershcovitch, Andrew Wood, Leshem Choshen, Guy Girmon- sky, Roy Leibovitz, Ilias Ennmouri, Michal Malka, Peter Chin, Swami- nathan Sundararaman, and Danny Harnik. 2025. ZipNN: Lossless Compression for AI Models. In2025 IEEE 18th International Conference on Cloud Computing (CLOUD). 186–198. doi:10.1109/CLOUD67622.2 025.00028

  22. [22]

    Zhiyi Hu, Siyuan Shen, Tommaso Bonato, et al. 2025. Demystifying NCCL: An In-Depth Analysis of GPU Communication Protocols and Algorithms. In2025 IEEE Symposium on High-Performance Intercon- nects (HOTI). 48–59. doi:10.1109/HOTI66940.2025.00024

  23. [23]

    Jiajun Huang, Sheng Di, Yafan Huang, et al . 2025. GhZCCL: Ad- vancing GPU-aware Collective Communications with Homomorphic Compression. InProceedings of the 39th ACM International Conference on Supercomputing (ICS ’25). 43–56. doi:10.1145/3721145.3733642

  24. [24]

    Jiajun Huang, Sheng Di, Xiaodong Yu, et al . 2024. gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters. InProceedings of the 38th ACM International Conference on Supercomputing (ICS ’24). 437–448. doi:10.1145/3650200.3656636

  25. [25]

    Jiajun Huang, Sheng Di, Xiaodong Yu, et al . 2024. An Optimized Error-Controlled MPI Collective Framework Integrated with Lossy Compression. In2024 IEEE International Parallel and Distributed Pro- cessing Symposium (IPDPS). 752–764

  26. [26]

    Jiajun Huang, Sheng Di, Xiaodong Yu, et al. 2025. ZCCL: Significantly Improving Collective Communication With Error-Bounded Lossy Compression. arXiv:2502.18554 [cs.DC]

  27. [27]

    Hoskins, Matthew W

    Siyuan Huang, Brian D. Hoskins, Matthew W. Daniels, et al . 2023. Low-Rank Gradient Descent for Memory-Efficient Training of Deep In-Memory Arrays.Journal of Emerging Technologies in Computing Systems19, 2, Article 16 (2023), 24 pages. doi:10.1145/3577214

  28. [28]

    Samirasadat Jamalidinan and Kazem Cheshmi. 2025. Floating- Point Data Transformation for Lossless Compression. 12 arXiv:2506.18062 [cs.DB]

  29. [29]

    Sylvain Jeaugey. 2017. NCCL: Optimized Primitives for Collective Multi-GPU Communication.https://developer.nvidia.com/nccl

  30. [30]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, et al . 2023. Efficient Memory Management for Large Language Model Serving with Page- dAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles. doi:10.1145/3600006.3613165

  31. [31]

    Alexander Langer, Samuel Howell, Sreeram Potluri, et al. 2021. Dy- namic Symmetric Heap Allocation in NVSHMEM. InOpenSHMEM and Related Technologies (Lecture Notes in Computer Science). Springer, 187–198. doi:10.1007/978-3-031-04888-3_12

  32. [32]

    Minghao Li, Ran Ben Basat, Shay Vargaftik, et al . 2024. THC: Ac- celerating Distributed Deep Learning Using Tensor Homomorphic Compression. arXiv:2302.08545 [cs.LG]

  33. [33]

    Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang Wang, and Yang You. 2023. Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training. In Proceedings of the 52nd International Conference on Parallel Processing (ICPP ’23). Association for Computing Machinery, 766–775. doi:10.114 5/3605573.3605613

  34. [34]

    Xue Li, Cheng Guo, Kun Qian, et al . 2024. Near-Lossless Gradient Compression for Data-Parallel Distributed DNN Training. InProceed- ings of the ACM Symposium on Cloud Computing (SoCC ’24). 977–994. doi:10.1145/3698038.3698541

  35. [35]

    Nandor Licker, Kevin Hu, Vladimir Zaytsev, and Lequn Chen

  36. [36]

    fabric-lib: RDMA Point-to-Point Communication for LLM Systems

    RDMA Point-to-Point Communication for LLM Systems. arXiv:2510.27656 [cs.DC]

  37. [37]

    Meta AI Research. 2026. DietGPU.https://github.com/facebookresea rch/dietgpu. GitHub repository, accessed 2026-03-07

  38. [38]

    Mooncake Project. 2024. Mooncake Transfer Engine.https://github.c om/kvcache-ai/Mooncake. Accessed: 2026

  39. [39]

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. InProceedings of the Interna- tional Conference for Hi...

  40. [40]

    NVIDIA. 2023. nvCOMP: NVIDIA GPU Data Compression Library. https://github.com/NVIDIA/nvcomp. Accessed: July 31, 2023

  41. [41]

    NVIDIA. 2023. NVIDIA CUDA C Programming Guide.https://docs.n vidia.com/cuda/cuda-c-programming-guide/

  42. [42]

    NVIDIA. 2025. NIXL: NVIDIA Inference Xfer Library.https://github .com/ai-dynamo/nixl

  43. [43]

    Qwen, An Yang, Baosong Yang, et al. 2025. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL]

  44. [44]

    Sudarsanan Rajasekaran, Manya Ghobadi, and Aditya Akella. 2024. CASSINI: Network-Aware Job Scheduling in Machine Learning Clus- ters. InProceedings of the 21st USENIX Symposium on Networked Sys- tems Design and Implementation (NSDI ’24). USENIX Association, Ar- ticle 78, 18 pages

  45. [45]

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He

  46. [46]

    Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

    DeepSpeed: System Optimizations Enable Training Deep Learn- ing Models with Over 100 Billion Parameters. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’20). Association for Computing Machinery, 3505–3506. doi:10.1145/3394486.3406703

  47. [47]

    Aashaka Shah, Abhinav Jangda, Binyang Li, Caio Rocha, Changho Hwang, Jithin Jose, Madan Musuvathi, Olli Saarikivi, Peng Cheng, Qinghua Zhou, Roshan Dathathri, Saeed Maleki, and Ziyue Yang

  48. [48]

    Msccl++: Rethinking gpu communication abstractions for cutting-edge ai applications, 2025

    MSCCL++: Rethinking GPU Communication Abstractions for Cutting-edge AI Applications. arXiv:2504.09014 [cs.DC]

  49. [49]

    Gregory Pauloski, et al

    Baixi Sun, Weijin Liu, J. Gregory Pauloski, et al. 2025. COMPSO: Opti- mizing Gradient Compression for Distributed Training with Second- Order Optimizers. InProceedings of the 30th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’25). 212–224. doi:10.1145/3710848.3710852

  50. [50]

    Yuki Takezawa, Kenta Niwa, and Makoto Yamada. 2023. Communica- tion Compression for Decentralized Learning With Operator Splitting Methods.IEEE Transactions on Signal and Information Processing over Networks9 (2023), 581–595. doi:10.1109/TSIPN.2023.3307894

  51. [51]

    UCCL Project. 2024. KV Transfer Engine: High-Performance GPU Communication in UCCL.https://uccl-project.github.io/posts/kv- transfer-engine/. Accessed: 2026

  52. [52]

    V Team, Wenyi Hong, et al. 2025. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning. arXiv:2507.01006 [cs.CV]

  53. [53]

    Guanhua Wang, Heyang Qin, Sam Ade Jacobs, et al. 2023. ZeRO++: Ex- tremely Efficient Collective Communication for Giant Model Training. arXiv:2306.10209 [cs.DC]

  54. [54]

    Ceyu Xu, Yongji Wu, Xinyu Yang, Beidi Chen, Matthew Lentz, Danyang Zhuo, and Lisa Wu Wills. 2025. LLM.265: Video Codecs are Secretly Tensor Codecs. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture (MICRO ’25). Asso- ciation for Computing Machinery, New York, NY, USA, 445–460. doi:10.1145/3725843.3756078

  55. [55]

    Abdelmoniem, et al

    Hang Xu, Chen-Yu Ho, Ahmed M. Abdelmoniem, et al. 2021. GRACE: A Compressed Communication Framework for Distributed Machine Learning. In2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS) (ICDCS ’21). 561–572. doi:10.1109/ICDC S51616.2021.00060

  56. [56]

    Annie Yang, Hari Mukka, Farbod Hesaaraki, and Martin Burtscher

  57. [57]

    In2015 IEEE International Conference on Cluster Computing

    MPC: A Massively Parallel Compression Algorithm for Scientific Data. In2015 IEEE International Conference on Cluster Computing. 381–

  58. [58]

    doi:10.1109/CLUSTER.2015.59

  59. [59]

    Tianyi Zhang, Mohsen Hariri, Shaochen Zhong, et al . 2025. 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11). InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  60. [60]

    Q. Zhou, C. Chu, N. S. Kumar, et al. 2021. Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters. In2021 IEEE International Parallel and Distributed Processing Sympo- sium (IPDPS) (IPDPS ’21). 444–453. doi:10.1109/IPDPS49936.2021.00053

  61. [61]

    Yang Zhou, Zhongjie Chen, Ziming Mao, et al. 2025. An Extensible Soft- ware Transport Layer for GPU Networking. arXiv:2504.17307 [cs.NI] 13