Recognition: unknown
ZipCCL: Efficient Lossless Data Compression of Communication Collectives for Accelerating LLM Training
Pith reviewed 2026-05-07 05:14 UTC · model grok-4.3
The pith
ZipCCL achieves up to 1.18 times faster end-to-end LLM training through lossless compression of communication collectives that exploits near-Gaussian tensor distributions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ZipCCL is a lossless compressed communication library for collectives that uses theoretically grounded exponent coding to exploit the near-Gaussian distribution of LLM tensors, GPU-optimized kernels with communication-aware data layouts and pipelining, and adaptive strategies that dynamically switch operations, producing up to 1.35 times lower communication time and 1.18 times end-to-end speedups on 64-GPU clusters for both mixture-of-experts and dense transformer models with no effect on model quality.
What carries the argument
exponent coding that exploits the near-Gaussian distribution of LLM tensors to accelerate compression without expensive online statistics, together with GPU-optimized compression and decompression kernels and adaptive collective switching
If this is right
- Communication time drops by up to 1.35 times when exponent coding matches the observed tensor statistics.
- End-to-end training finishes up to 1.18 times sooner on 64-GPU clusters for both dense and mixture-of-experts models.
- Model quality remains identical because every byte is restored exactly.
- The same library can be applied to existing training frameworks without changing the optimizer or loss function.
Where Pith is reading between the lines
- The technique could be tested on other collective patterns such as all-reduce in reinforcement-learning or graph neural-network training to see whether similar speedups appear.
- On clusters larger than 64 GPUs the relative benefit of reduced data volume may grow because communication volume scales with the number of nodes.
- If future hardware adds native support for the exponent-coding format, the GPU-kernel overhead could drop further and widen the speedup window.
Load-bearing premise
That communication tensors stay near-Gaussian throughout training and that the added cost of compression and decompression always stays below the time saved by sending fewer bytes on real workloads.
What would settle it
Measure total compression-plus-communication time on a workload whose tensors deviate markedly from Gaussian (for example uniform random values) and check whether the net time exceeds the uncompressed baseline.
Figures
read the original abstract
Communication has emerged as a critical bottleneck in the distributed training of large language models (LLMs). While numerous approaches have been proposed to reduce communication overhead, the potential of lossless compression has remained largely underexplored since compression and decompression typically consume larger overheads than the benefits of reduced communication traffic. We observe that the communication data, including activations, gradients and parameters, during training often follows a near-Gaussian distribution, which is a key feature for data compression. Thus, we introduce ZipCCL, a lossless compressed communication library of collectives for LLM training. ZipCCL is equipped with our novel techniques: (1) theoretically grounded exponent coding that exploits the Gaussian distribution of LLM tensors to accelerate compression without expensive online statistics, (2) GPU-optimized compression and decompression kernels that carefully design memory access patterns and pipeline using communication-aware data layout, and (3) adaptive communication strategies that dynamically switch collective operations based on workload patterns and system characteristics. Evaluated on a 64-GPU cluster using both mixture-of-experts and dense transformer models, ZipCCL reduces communication time by up to 1.35$\times$ and achieves end-to-end training speedups of up to 1.18$\times$ without any impact on model quality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ZipCCL, a lossless compression library for communication collectives in distributed LLM training. It exploits the near-Gaussian distribution of activations, gradients, and parameters via a fixed exponent coding scheme that avoids online statistics, custom GPU kernels with communication-aware data layouts and pipelining, and adaptive switching between compressed and uncompressed collectives. On a 64-GPU cluster using both MoE and dense transformer models, it reports up to 1.35× reduction in communication time and up to 1.18× end-to-end training speedup with no degradation in model quality.
Significance. If the net speedups are robustly demonstrated, this would be a meaningful systems contribution to scaling LLM training by addressing the communication bottleneck through lossless compression, an approach that has seen limited prior exploration due to overhead concerns. The combination of distribution-aware coding, hardware-specific kernel optimizations, and adaptive fallback provides practical value. The evaluation on real MoE and dense models with hardware timing measurements is a strength, though broader validation across more workloads would increase impact.
major comments (3)
- [§5 (Evaluation) and Table 4] §5 (Evaluation) and Table 4: The reported 1.35× communication and 1.18× end-to-end speedups are aggregate figures. Provide per-collective and per-phase breakdowns (e.g., all-reduce for gradients vs. all-gather for activations) showing the fraction of operations where the adaptive mechanism selects compression versus fallback, the achieved compression ratios, and the measured net (compress + collective + decompress) time delta versus the uncompressed baseline. Without this granularity, it is impossible to verify that gains persist when tensor distributions deviate from near-Gaussian.
- [§4.1 (Exponent Coding)] §4.1 (Exponent Coding): The exponent coding is presented as theoretically grounded and parameter-free under the Gaussian assumption. Include an explicit derivation or analysis (e.g., expected bits per value as a function of variance) showing the breakeven compression ratio needed for net time savings, and report empirical statistics (skewness, kurtosis, or tail behavior) for the actual tensors encountered during training of the evaluated MoE and dense models. Any systematic deviation late in training would directly undermine the claimed speedups.
- [§3.3 (Adaptive Communication Strategies)] §3.3 (Adaptive Communication Strategies): The adaptive switching is load-bearing for robustness. Detail the exact decision criteria, thresholds, or runtime measurements used to choose between compressed and uncompressed paths, and quantify the overhead of the decision logic itself. Clarify whether the switching decision is made per-collective or per-tensor and how it interacts with the custom kernels.
minor comments (3)
- [§2 (Related Work)] §2 (Related Work): The discussion of prior compression and collective optimization work is adequate but could reference additional recent systems papers on gradient compression or NCCL extensions from 2023–2024 for completeness.
- [Figure 4 (Kernel Design)] Figure 4 (Kernel Design): The memory access pattern and pipeline diagrams would benefit from explicit labels indicating which stages overlap with communication and the data layout transformations used.
- [§5.3 (Model Quality)] §5.3 (Model Quality): The claim of 'no impact on model quality' should explicitly state the metrics (e.g., final perplexity, downstream accuracy) and the exact training configurations compared (same random seeds, number of steps, etc.).
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback on our manuscript. The comments have helped us improve the presentation of our results and technical details. We have made revisions to address all major comments, as outlined in the point-by-point responses below.
read point-by-point responses
-
Referee: [§5 (Evaluation) and Table 4] The reported 1.35× communication and 1.18× end-to-end speedups are aggregate figures. Provide per-collective and per-phase breakdowns (e.g., all-reduce for gradients vs. all-gather for activations) showing the fraction of operations where the adaptive mechanism selects compression versus fallback, the achieved compression ratios, and the measured net (compress + collective + decompress) time delta versus the uncompressed baseline. Without this granularity, it is impossible to verify that gains persist when tensor distributions deviate from near-Gaussian.
Authors: We agree that more granular breakdowns are necessary to fully substantiate the claims. In the revised manuscript, we have added detailed per-collective and per-phase analysis in Section 5.3 and a new Table 5. For all-reduce collectives on gradients, the adaptive mechanism selects the compressed path in 89% of cases with an average compression ratio of 1.75×, yielding a net time reduction of 32% compared to baseline. For all-gather on activations, selection occurs in 82% of operations with 1.55× ratio and 25% net savings. These figures are consistent across training phases, with only minor degradation in later epochs due to slight distribution shifts, but still positive net gains. The data confirms robustness even under deviations from ideal Gaussian. revision: yes
-
Referee: [§4.1 (Exponent Coding)] The exponent coding is presented as theoretically grounded and parameter-free under the Gaussian assumption. Include an explicit derivation or analysis (e.g., expected bits per value as a function of variance) showing the breakeven compression ratio needed for net time savings, and report empirical statistics (skewness, kurtosis, or tail behavior) for the actual tensors encountered during training of the evaluated MoE and dense models. Any systematic deviation late in training would directly undermine the claimed speedups.
Authors: We have incorporated an explicit derivation in the revised Section 4.1 and Appendix B. The analysis shows that under the Gaussian assumption with variance σ², the expected bits per value for our fixed exponent coding is approximately 1 + log2(σ) + constant, leading to a breakeven compression ratio of 1.45× to achieve net time savings given the measured kernel overheads on our GPU cluster. We also report empirical statistics in new Table 6 for the MoE and dense models: average skewness of 0.12, kurtosis of 3.1, with tail behavior showing 99.5% of values within 4σ, consistent with near-Gaussian. Monitoring over the full training run reveals no significant systematic deviation in later stages, with kurtosis remaining stable between 2.9 and 3.3. revision: yes
-
Referee: [§3.3 (Adaptive Communication Strategies)] The adaptive switching is load-bearing for robustness. Detail the exact decision criteria, thresholds, or runtime measurements used to choose between compressed and uncompressed paths, and quantify the overhead of the decision logic itself. Clarify whether the switching decision is made per-collective or per-tensor and how it interacts with the custom kernels.
Authors: In the updated Section 3.3, we now provide the precise decision logic: the choice is made per-tensor by sampling 1024 elements to estimate the standard deviation and predicting the compression benefit using a precomputed model; if the predicted net speedup exceeds 10%, compressed path is chosen. This decision is computed on the host before launching the collective and has an overhead of less than 0.3 μs per tensor, which is negligible (under 0.1% of total communication time). The decision is per-tensor but aggregated for the collective operation, and the custom GPU kernels are designed with separate optimized paths for compressed and uncompressed modes to avoid runtime branching inside the kernel. revision: yes
Circularity Check
No circularity; empirical speedups rest on direct measurements, not self-referential derivations
full rationale
The paper's core claims are algorithmic implementations (exponent coding motivated by observed near-Gaussian tensor statistics, GPU kernels with communication-aware layouts, and adaptive switching) followed by hardware timing measurements on a 64-GPU cluster. No equations, predictions, or first-principles results are presented that reduce by construction to fitted inputs or prior self-citations. The Gaussian observation is used as an engineering premise for fixed coding (no online stats), but the reported 1.35× and 1.18× factors are aggregates of measured wall-clock times, not outputs of any model that was calibrated on the same data. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. The derivation chain is therefore self-contained and externally falsifiable via replication of the timing experiments.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Communication data (activations, gradients, parameters) during LLM training follows a near-Gaussian distribution
Reference graph
Works this paper leans on
-
[1]
Esha Choukse, Mattan Erez, and Alaa R Alameldeen. 2018. Compresso: Pragmatic main memory compression. In2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 546–558
2018
-
[2]
Esha Choukse, Michael B Sullivan, Mike O’Connor, Mattan Erez, Jeff Pool, David Nellans, and Stephen W Keckler. 2020. Buddy compression: Enabling larger memory for deep learning and hpc workloads on gpus. In2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 926–939
2020
-
[3]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. 2012. Large scale distributed deep networks.Advances in neural information processing systems25 (2012)
2012
-
[4]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer
-
[5]
InNeurIPS
QLoRA: Efficient Finetuning of Quantized LLMs. InNeurIPS
-
[6]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Ka- dian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)
work page internal anchor Pith review arXiv 2024
-
[7]
Jarek Duda, Khalid Tahboub, Neeraj J Gadgil, and Edward J Delp. 2015. The use of asymmetric numeral systems as an accurate replacement for Huffman coding. In2015 Picture Coding Symposium (PCS). IEEE, 65–69
2015
-
[8]
Magnus Ekman and Per Stenstrom. 2005. A robust main-memory compression scheme. In32nd International Symposium on Computer Architecture (ISCA’05). IEEE, 74–85
2005
-
[9]
Ruibo Fan, Xiangrui Yu, Xinglin Pan, Zeyu Li, Weile Luo, Qiang Wang, Wei Wang, and Xiaowen Chu. 2026. ZipServ: Fast and Memory- Efficient LLM Inference with Hardware-Aware Lossless Compression. InProceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’26)
2026
- [10]
- [11]
-
[12]
David A Huffman. 2007. A method for the construction of minimum- redundancy codes.Proceedings of the IRE40, 9 (2007), 1098–1101
2007
-
[13]
Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, et al. 2023. Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems5 (2023)
2023
-
[14]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts.arXiv preprint arXiv:2401.04088(2024)
work page internal anchor Pith review arXiv 2024
-
[15]
Chenyu Jiang, Ye Tian, Zhen Jia, Chuan Wu, Yida Wang, and Shuai Zheng. 2024. Lancet: Accelerating Mixture-of-Experts Training by Overlapping Weight Gradient Computation and All-to-All Communi- cation.Proceedings of Machine Learning and Systems6 (2024), 74–86
2024
-
[16]
Zewen Jin, Shengnan Wang, Jiaan Zhu, Hongrui Zhan, Youhui Bai, Lin Zhang, Zhenyu Ming, and Cheng Li. 2025. BigMac: A Communication- Efficient Mixture-of-Experts Model Structure for Fast Training and Inference.Proceedings of the AAAI Conference on Artificial Intelligence 39, 17 (Apr. 2025), 17689–17698. https://doi.org/10.1609/aaai.v39i17. 33945
-
[17]
Jeff Johnson. 2024. DIET-GPU: Efficient Model Inference on GPUs. https://github.com/facebookresearch/dietgpu. (2024)
2024
-
[18]
Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, et al. 2019. A study of BFLOAT16 for deep learning training.arXiv preprint arXiv:1905.12322 (2019)
work page Pith review arXiv 2019
-
[19]
Jungrae Kim, Michael Sullivan, Esha Choukse, and Mattan Erez. 2016. Bit-plane compression: Transforming data for better compression in many-core architectures.ACM SIGARCH Computer Architecture News 44, 3 (2016), 329–340
2016
-
[20]
Mahoney, and Kurt Keutzer
Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xi- uyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. 2024. SqueezeLLM: Dense-and-Sparse Quantization. InICML. OpenRe- view.net
2024
-
[21]
Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. 2017. Self-Normalizing Neural Networks. InNIPS. 971– 980
2017
-
[22]
Hoil Lee, Fadhel Ayed, Paul Jung, Juho Lee, Hongseok Yang, and Fran- cois Caron. 2023. Deep Neural Networks with Dependent Weights: Gaussian Process Mixture Limit, Heavy Tails, Sparsity and Compress- ibility.J. Mach. Learn. Res.24 (2023), 289:1–289:78
2023
-
[23]
Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, and Stratos Idreos. 2025. TorchTitan: One-stop PyTorch native solution for production ready LLM pretraining. InThe Thirteenth International Conference on Learn- ing Representations
2025
-
[24]
Jihao Andreas Lin, Shreyas Padhy, Javier Antoran, Austin Tripp, Alexander Terenin, Csaba Szepesvari, José Miguel Hernández-Lobato, and David Janz. 2024. Stochastic Gradient Descent for Gaussian Pro- cesses Done Right. InThe Twelfth International Conference on Learning Representations
2024
-
[25]
Wenxiang Lin, Xinglin Pan, Shaohuai Shi, Xuan Wang, Bo Li, and Xiaowen Chu. 2025. Mast: Efficient Training of Mixture-of-Experts Transformers with Task Pipelining and Ordering. In2025 IEEE 45th International Conference on Distributed Computing Systems. IEEE
2025
- [26]
-
[27]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al
-
[28]
Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437 (2024)
work page internal anchor Pith review arXiv 2024
-
[29]
Shuang Ma, Chon Lam Lao, Zhiying Xu, Zhuang Wang, Ziming Mao, Delong Meng, Jia Zhen, Jun Wu, Ion Stoica, Yida Wang, and Yang Zhou. 2026. UCCL-Zip: Lossless Compression Supercharged GPU Communication.arXiv e-prints(April 2026)
2026
-
[30]
Diamos, Erich Elsen, David García, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David García, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed Precision Training. InICLR (Poster). OpenReview.net
2018
-
[31]
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient large-scale language model training on gpu clusters using megatron- lm. InProceedings of the international conference for high performance computing, netwo...
2021
-
[32]
NVIDIA. 2025. nvcomp: Repository for nvCOMP docs and examples. https://github.com/NVIDIA/nvcomp. (2025). Accessed: 2025-08-18
2025
-
[33]
NVIDIA. 2026. NCCL (pronounced "Nickel") is a stand-alone library of standard communication routines for GPUs. (2026). https://github. com/NVIDIA/nccl
2026
-
[34]
Xinglin Pan, Wenxiang Lin, Lin Zhang, Shaohuai Shi, Zhenheng Tang, Rui Wang, Bo Li, and Xiaowen Chu. 2025. FSMoE: A flexible and scalable training system for sparse mixture-of-experts models. InPro- ceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 524–539
2025
-
[35]
Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Phillip B Gibbons, Michael A Kozuch, and Todd C Mowry
-
[36]
InProceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Linearly compressed pages: A low-complexity, low-latency main memory compression framework. InProceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. 172–184
-
[37]
Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Phillip B Gibbons, Michael A Kozuch, and Todd C Mowry. 2012. Base-delta-immediate compression: Practical data compression for on-chip caches. InPro- ceedings of the 21st international conference on Parallel architectures and compilation techniques. 377–388
2012
-
[38]
Joshua Peterson, Stephan Meylan, and David Bourgin. 2019. Open clone of openai’s unreleased webtext dataset scraper. (2019)
2019
-
[39]
Rolf Rabenseifner. 2004. Optimization of collective reduction opera- tions. InInternational Conference on Computational Science. Springer, 1–9
2004
-
[40]
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2016. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In International Conference on Learning Representations
2016
- [41]
-
[42]
Shaohuai Shi, Xinglin Pan, Xiaowen Chu, and Bo Li. 2023. PipeMoE: Accelerating Mixture-of-Experts through Adaptive Pipelining. InIEEE INFOCOM 2023-IEEE Conference on Computer Communications
2023
-
[43]
Shaohuai Shi, Xinglin Pan, Qiang Wang, Chengjian Liu, Xiaozhe Ren, Zhongzhe Hu, Yu Yang, Bo Li, and Xiaowen Chu. 2024. ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling. InProceedings of the Nineteenth European Conference on Computer Systems. 236–249
2024
- [44]
-
[45]
Qwen Team. 2025. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025)
work page internal anchor Pith review arXiv 2025
-
[46]
Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Opti- mization of collective communication operations in MPICH.The International Journal of High Performance Computing Applications19, 1 (2005), 49–66
2005
- [47]
-
[48]
Devanur, and Ion Stoica
Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Jorgen Thelin, Nikhil R. Devanur, and Ion Stoica. 2020. Blink: Fast and Generic Collectives for Distributed ML. InMLSys. mlsys.org
2020
-
[49]
Ne Wang, Wenxiang Lin, Lin Zhang, Shaohuai Shi, Ruiting Zhou, and Bo Li. 2025. SP-MoE: Expediting Mixture-of-Experts Training with Optimized Pipelining Planning. InIEEE INFOCOM 2025-IEEE Conference on Computer Communications
2025
-
[50]
Haocheng Xi, Han Cai, Ligeng Zhu, Yao Lu, Kurt Keutzer, Jianfei Chen, and Song Han. 2025. COAT: Compressing Optimizer states and Activations for Memory-Efficient FP8 Training. InThe Thirteenth International Conference on Learning Representations
2025
- [51]
-
[52]
Jishen Zhao, Sheng Li, Jichuan Chang, John L Byrne, Laura L Ramirez, Kevin Lim, Yuan Xie, and Paolo Faraboschi. 2015. Buri: Scaling big- memory computing with hardware-based memory expansion.ACM Transactions on Architecture and Code Optimization (TACO)12, 3 (2015), 1–24
2015
-
[53]
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. 2023. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.Proceedings of the VLDB Endowment16, 12 (2023), 3848–3860. 14
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.