pith. machine review for the scientific record. sign in

arxiv: 2604.27844 · v1 · submitted 2026-04-30 · 💻 cs.DC · cs.CL

Recognition: unknown

ZipCCL: Efficient Lossless Data Compression of Communication Collectives for Accelerating LLM Training

Authors on Pith no claims yet

Pith reviewed 2026-05-07 05:14 UTC · model grok-4.3

classification 💻 cs.DC cs.CL
keywords lossless compressionLLM trainingdistributed trainingcommunication collectivesGPU kernelsGaussian distributionexponent codingadaptive communication
0
0 comments X

The pith

ZipCCL achieves up to 1.18 times faster end-to-end LLM training through lossless compression of communication collectives that exploits near-Gaussian tensor distributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that communication overhead in distributed LLM training can be reduced by lossless compression when the method is built around the statistical properties of the data being moved. Activations, gradients, and parameters during training typically follow a near-Gaussian distribution, which the authors use to create an exponent coding scheme that avoids the usual cost of collecting online statistics. They pair this coding with custom GPU kernels that optimize memory access and pipelining, plus an adaptive layer that switches between compressed and uncompressed collective operations depending on workload and hardware. On a 64-GPU cluster the approach shortens communication time by up to 1.35 times and delivers overall training speedups of up to 1.18 times for both dense transformers and mixture-of-experts models while leaving final model quality unchanged. A reader cares because communication is the dominant scaling bottleneck once models exceed single-GPU size, and a lossless method removes the usual worry that compression will degrade accuracy.

Core claim

ZipCCL is a lossless compressed communication library for collectives that uses theoretically grounded exponent coding to exploit the near-Gaussian distribution of LLM tensors, GPU-optimized kernels with communication-aware data layouts and pipelining, and adaptive strategies that dynamically switch operations, producing up to 1.35 times lower communication time and 1.18 times end-to-end speedups on 64-GPU clusters for both mixture-of-experts and dense transformer models with no effect on model quality.

What carries the argument

exponent coding that exploits the near-Gaussian distribution of LLM tensors to accelerate compression without expensive online statistics, together with GPU-optimized compression and decompression kernels and adaptive collective switching

If this is right

  • Communication time drops by up to 1.35 times when exponent coding matches the observed tensor statistics.
  • End-to-end training finishes up to 1.18 times sooner on 64-GPU clusters for both dense and mixture-of-experts models.
  • Model quality remains identical because every byte is restored exactly.
  • The same library can be applied to existing training frameworks without changing the optimizer or loss function.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique could be tested on other collective patterns such as all-reduce in reinforcement-learning or graph neural-network training to see whether similar speedups appear.
  • On clusters larger than 64 GPUs the relative benefit of reduced data volume may grow because communication volume scales with the number of nodes.
  • If future hardware adds native support for the exponent-coding format, the GPU-kernel overhead could drop further and widen the speedup window.

Load-bearing premise

That communication tensors stay near-Gaussian throughout training and that the added cost of compression and decompression always stays below the time saved by sending fewer bytes on real workloads.

What would settle it

Measure total compression-plus-communication time on a workload whose tensors deviate markedly from Gaussian (for example uniform random values) and check whether the net time exceeds the uncompressed baseline.

Figures

Figures reproduced from arXiv: 2604.27844 by Ruibo Fan, Shaohuai Shi, Wenxiang Lin, Xiaowen Chu, Xinglin Pan.

Figure 1
Figure 1. Figure 1: Left: Compression ratio (original communi view at source ↗
Figure 2
Figure 2. Figure 2: A 4-GPU example of the All-to-All collective. view at source ↗
Figure 3
Figure 3. Figure 3: A 4-GPU example of All-Gather, Reduce￾Scatter and All-Reduce collectives. also gathers parameters to compute gradients which are then synchronized via Reduce-Scatter, and optimizer updates are performed on the sharded parameters. FSDP enables train￾ing of very large models that would otherwise exceed the memory of a single accelerator. Data Parallelism synchronizes model states by replicat￾ing the full mod… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of the naive implementation of view at source ↗
Figure 5
Figure 5. Figure 5: The workflow of our zipcclAllGather, zipcclAlltoAll, and zipcclReduceScatter. memory hierarchies, minimize kernel launch overhead, and efficiently overlap compression with memory access is essential to achieve practical end-to-end speedups in LLM training. 3 ZIPCCL: LIBRARY DESIGN To seamlessly support the communication collectives for LLM training, we design our ZipCCL, where each zipped collective is one… view at source ↗
Figure 6
Figure 6. Figure 6: The workflows of our Compressor and De￾compressor on the GPU. to 11 bits, achieving a theoretical compression ratio of up to 31%. As illustrated in view at source ↗
Figure 7
Figure 7. Figure 7: The data layout in shared memory. “B0, B1, view at source ↗
Figure 8
Figure 8. Figure 8: An example of the imbalanced computation view at source ↗
Figure 9
Figure 9. Figure 9: End-to-end speedups of our ZipCCL over base view at source ↗
Figure 12
Figure 12. Figure 12: Computation time (compression and decom view at source ↗
Figure 13
Figure 13. Figure 13: All-to-All and All-Gather time of ZipCCL view at source ↗
Figure 14
Figure 14. Figure 14: End-to-end and All-to-All communication speedups of Zipped All-to-All Design-2 over Design-1 on Qwen3-MoE. configuration as in §6.2. Results shown in view at source ↗
Figure 15
Figure 15. Figure 15: End-to-end and All-to-All communication speedups of our ZipCCL with the adaptive switcher compared to the base implementation without the adaptive switcher across seven scenarios. Effect of the Adaptive Switcher for Reduce-Scatter. To evaluate the adaptive switcher proposed in §5.2, we con￾duct Llama3-8B training experiments on TorchTitan with seven different hardware configurations: (1) 1 node, 8 GPUs; (… view at source ↗
read the original abstract

Communication has emerged as a critical bottleneck in the distributed training of large language models (LLMs). While numerous approaches have been proposed to reduce communication overhead, the potential of lossless compression has remained largely underexplored since compression and decompression typically consume larger overheads than the benefits of reduced communication traffic. We observe that the communication data, including activations, gradients and parameters, during training often follows a near-Gaussian distribution, which is a key feature for data compression. Thus, we introduce ZipCCL, a lossless compressed communication library of collectives for LLM training. ZipCCL is equipped with our novel techniques: (1) theoretically grounded exponent coding that exploits the Gaussian distribution of LLM tensors to accelerate compression without expensive online statistics, (2) GPU-optimized compression and decompression kernels that carefully design memory access patterns and pipeline using communication-aware data layout, and (3) adaptive communication strategies that dynamically switch collective operations based on workload patterns and system characteristics. Evaluated on a 64-GPU cluster using both mixture-of-experts and dense transformer models, ZipCCL reduces communication time by up to 1.35$\times$ and achieves end-to-end training speedups of up to 1.18$\times$ without any impact on model quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces ZipCCL, a lossless compression library for communication collectives in distributed LLM training. It exploits the near-Gaussian distribution of activations, gradients, and parameters via a fixed exponent coding scheme that avoids online statistics, custom GPU kernels with communication-aware data layouts and pipelining, and adaptive switching between compressed and uncompressed collectives. On a 64-GPU cluster using both MoE and dense transformer models, it reports up to 1.35× reduction in communication time and up to 1.18× end-to-end training speedup with no degradation in model quality.

Significance. If the net speedups are robustly demonstrated, this would be a meaningful systems contribution to scaling LLM training by addressing the communication bottleneck through lossless compression, an approach that has seen limited prior exploration due to overhead concerns. The combination of distribution-aware coding, hardware-specific kernel optimizations, and adaptive fallback provides practical value. The evaluation on real MoE and dense models with hardware timing measurements is a strength, though broader validation across more workloads would increase impact.

major comments (3)
  1. [§5 (Evaluation) and Table 4] §5 (Evaluation) and Table 4: The reported 1.35× communication and 1.18× end-to-end speedups are aggregate figures. Provide per-collective and per-phase breakdowns (e.g., all-reduce for gradients vs. all-gather for activations) showing the fraction of operations where the adaptive mechanism selects compression versus fallback, the achieved compression ratios, and the measured net (compress + collective + decompress) time delta versus the uncompressed baseline. Without this granularity, it is impossible to verify that gains persist when tensor distributions deviate from near-Gaussian.
  2. [§4.1 (Exponent Coding)] §4.1 (Exponent Coding): The exponent coding is presented as theoretically grounded and parameter-free under the Gaussian assumption. Include an explicit derivation or analysis (e.g., expected bits per value as a function of variance) showing the breakeven compression ratio needed for net time savings, and report empirical statistics (skewness, kurtosis, or tail behavior) for the actual tensors encountered during training of the evaluated MoE and dense models. Any systematic deviation late in training would directly undermine the claimed speedups.
  3. [§3.3 (Adaptive Communication Strategies)] §3.3 (Adaptive Communication Strategies): The adaptive switching is load-bearing for robustness. Detail the exact decision criteria, thresholds, or runtime measurements used to choose between compressed and uncompressed paths, and quantify the overhead of the decision logic itself. Clarify whether the switching decision is made per-collective or per-tensor and how it interacts with the custom kernels.
minor comments (3)
  1. [§2 (Related Work)] §2 (Related Work): The discussion of prior compression and collective optimization work is adequate but could reference additional recent systems papers on gradient compression or NCCL extensions from 2023–2024 for completeness.
  2. [Figure 4 (Kernel Design)] Figure 4 (Kernel Design): The memory access pattern and pipeline diagrams would benefit from explicit labels indicating which stages overlap with communication and the data layout transformations used.
  3. [§5.3 (Model Quality)] §5.3 (Model Quality): The claim of 'no impact on model quality' should explicitly state the metrics (e.g., final perplexity, downstream accuracy) and the exact training configurations compared (same random seeds, number of steps, etc.).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. The comments have helped us improve the presentation of our results and technical details. We have made revisions to address all major comments, as outlined in the point-by-point responses below.

read point-by-point responses
  1. Referee: [§5 (Evaluation) and Table 4] The reported 1.35× communication and 1.18× end-to-end speedups are aggregate figures. Provide per-collective and per-phase breakdowns (e.g., all-reduce for gradients vs. all-gather for activations) showing the fraction of operations where the adaptive mechanism selects compression versus fallback, the achieved compression ratios, and the measured net (compress + collective + decompress) time delta versus the uncompressed baseline. Without this granularity, it is impossible to verify that gains persist when tensor distributions deviate from near-Gaussian.

    Authors: We agree that more granular breakdowns are necessary to fully substantiate the claims. In the revised manuscript, we have added detailed per-collective and per-phase analysis in Section 5.3 and a new Table 5. For all-reduce collectives on gradients, the adaptive mechanism selects the compressed path in 89% of cases with an average compression ratio of 1.75×, yielding a net time reduction of 32% compared to baseline. For all-gather on activations, selection occurs in 82% of operations with 1.55× ratio and 25% net savings. These figures are consistent across training phases, with only minor degradation in later epochs due to slight distribution shifts, but still positive net gains. The data confirms robustness even under deviations from ideal Gaussian. revision: yes

  2. Referee: [§4.1 (Exponent Coding)] The exponent coding is presented as theoretically grounded and parameter-free under the Gaussian assumption. Include an explicit derivation or analysis (e.g., expected bits per value as a function of variance) showing the breakeven compression ratio needed for net time savings, and report empirical statistics (skewness, kurtosis, or tail behavior) for the actual tensors encountered during training of the evaluated MoE and dense models. Any systematic deviation late in training would directly undermine the claimed speedups.

    Authors: We have incorporated an explicit derivation in the revised Section 4.1 and Appendix B. The analysis shows that under the Gaussian assumption with variance σ², the expected bits per value for our fixed exponent coding is approximately 1 + log2(σ) + constant, leading to a breakeven compression ratio of 1.45× to achieve net time savings given the measured kernel overheads on our GPU cluster. We also report empirical statistics in new Table 6 for the MoE and dense models: average skewness of 0.12, kurtosis of 3.1, with tail behavior showing 99.5% of values within 4σ, consistent with near-Gaussian. Monitoring over the full training run reveals no significant systematic deviation in later stages, with kurtosis remaining stable between 2.9 and 3.3. revision: yes

  3. Referee: [§3.3 (Adaptive Communication Strategies)] The adaptive switching is load-bearing for robustness. Detail the exact decision criteria, thresholds, or runtime measurements used to choose between compressed and uncompressed paths, and quantify the overhead of the decision logic itself. Clarify whether the switching decision is made per-collective or per-tensor and how it interacts with the custom kernels.

    Authors: In the updated Section 3.3, we now provide the precise decision logic: the choice is made per-tensor by sampling 1024 elements to estimate the standard deviation and predicting the compression benefit using a precomputed model; if the predicted net speedup exceeds 10%, compressed path is chosen. This decision is computed on the host before launching the collective and has an overhead of less than 0.3 μs per tensor, which is negligible (under 0.1% of total communication time). The decision is per-tensor but aggregated for the collective operation, and the custom GPU kernels are designed with separate optimized paths for compressed and uncompressed modes to avoid runtime branching inside the kernel. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical speedups rest on direct measurements, not self-referential derivations

full rationale

The paper's core claims are algorithmic implementations (exponent coding motivated by observed near-Gaussian tensor statistics, GPU kernels with communication-aware layouts, and adaptive switching) followed by hardware timing measurements on a 64-GPU cluster. No equations, predictions, or first-principles results are presented that reduce by construction to fitted inputs or prior self-citations. The Gaussian observation is used as an engineering premise for fixed coding (no online stats), but the reported 1.35× and 1.18× factors are aggregates of measured wall-clock times, not outputs of any model that was calibrated on the same data. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. The derivation chain is therefore self-contained and externally falsifiable via replication of the timing experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLM training tensors exhibit a near-Gaussian distribution that can be exploited for fast exponent coding without online statistics, plus the engineering assumption that custom GPU kernels can be made faster than the communication savings they produce.

axioms (1)
  • domain assumption Communication data (activations, gradients, parameters) during LLM training follows a near-Gaussian distribution
    Stated directly in the abstract as the key feature enabling compression without expensive online statistics.

pith-pipeline@v0.9.0 · 5532 in / 1430 out tokens · 68544 ms · 2026-05-07T05:14:28.174797+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 13 canonical work pages · 4 internal anchors

  1. [1]

    Esha Choukse, Mattan Erez, and Alaa R Alameldeen. 2018. Compresso: Pragmatic main memory compression. In2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 546–558

  2. [2]

    Esha Choukse, Michael B Sullivan, Mike O’Connor, Mattan Erez, Jeff Pool, David Nellans, and Stephen W Keckler. 2020. Buddy compression: Enabling larger memory for deep learning and hpc workloads on gpus. In2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 926–939

  3. [3]

    Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. 2012. Large scale distributed deep networks.Advances in neural information processing systems25 (2012)

  4. [4]

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer

  5. [5]

    InNeurIPS

    QLoRA: Efficient Finetuning of Quantized LLMs. InNeurIPS

  6. [6]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Ka- dian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

  7. [7]

    Jarek Duda, Khalid Tahboub, Neeraj J Gadgil, and Edward J Delp. 2015. The use of asymmetric numeral systems as an accurate replacement for Huffman coding. In2015 Picture Coding Symposium (PCS). IEEE, 65–69

  8. [8]

    Magnus Ekman and Per Stenstrom. 2005. A robust main-memory compression scheme. In32nd International Symposium on Computer Architecture (ISCA’05). IEEE, 74–85

  9. [9]

    Ruibo Fan, Xiangrui Yu, Xinglin Pan, Zeyu Li, Weile Luo, Qiang Wang, Wei Wang, and Xiaowen Chu. 2026. ZipServ: Fast and Memory- Efficient LLM Inference with Hardware-Aware Lossless Compression. InProceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’26)

  10. [10]

    Yongchang Hao, Yanshuai Cao, and Lili Mou. 2024. NeuZip: Memory- Efficient Training and Inference with Dynamic Compression of Neural Networks.CoRRabs/2410.20650 (2024)

  11. [11]

    Moshik Hershcovitch, Andrew Wood, Leshem Choshen, Guy Girmon- sky, Roy Leibovitz, Ilias Ennmouri, Michal Malka, Peter Chin, Swami- nathan Sundararaman, and Danny Harnik. 2024. ZipNN: Lossless Compression for AI Models.CoRRabs/2411.05239 (2024)

  12. [12]

    David A Huffman. 2007. A method for the construction of minimum- redundancy codes.Proceedings of the IRE40, 9 (2007), 1098–1101

  13. [13]

    Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, et al. 2023. Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems5 (2023)

  14. [14]

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts.arXiv preprint arXiv:2401.04088(2024)

  15. [15]

    Chenyu Jiang, Ye Tian, Zhen Jia, Chuan Wu, Yida Wang, and Shuai Zheng. 2024. Lancet: Accelerating Mixture-of-Experts Training by Overlapping Weight Gradient Computation and All-to-All Communi- cation.Proceedings of Machine Learning and Systems6 (2024), 74–86

  16. [16]

    Zewen Jin, Shengnan Wang, Jiaan Zhu, Hongrui Zhan, Youhui Bai, Lin Zhang, Zhenyu Ming, and Cheng Li. 2025. BigMac: A Communication- Efficient Mixture-of-Experts Model Structure for Fast Training and Inference.Proceedings of the AAAI Conference on Artificial Intelligence 39, 17 (Apr. 2025), 17689–17698. https://doi.org/10.1609/aaai.v39i17. 33945

  17. [17]

    Jeff Johnson. 2024. DIET-GPU: Efficient Model Inference on GPUs. https://github.com/facebookresearch/dietgpu. (2024)

  18. [18]

    Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, et al. 2019. A study of BFLOAT16 for deep learning training.arXiv preprint arXiv:1905.12322 (2019)

  19. [19]

    Jungrae Kim, Michael Sullivan, Esha Choukse, and Mattan Erez. 2016. Bit-plane compression: Transforming data for better compression in many-core architectures.ACM SIGARCH Computer Architecture News 44, 3 (2016), 329–340

  20. [20]

    Mahoney, and Kurt Keutzer

    Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xi- uyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. 2024. SqueezeLLM: Dense-and-Sparse Quantization. InICML. OpenRe- view.net

  21. [21]

    Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. 2017. Self-Normalizing Neural Networks. InNIPS. 971– 980

  22. [22]

    Hoil Lee, Fadhel Ayed, Paul Jung, Juho Lee, Hongseok Yang, and Fran- cois Caron. 2023. Deep Neural Networks with Dependent Weights: Gaussian Process Mixture Limit, Heavy Tails, Sparsity and Compress- ibility.J. Mach. Learn. Res.24 (2023), 289:1–289:78

  23. [23]

    Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, and Stratos Idreos. 2025. TorchTitan: One-stop PyTorch native solution for production ready LLM pretraining. InThe Thirteenth International Conference on Learn- ing Representations

  24. [24]

    Jihao Andreas Lin, Shreyas Padhy, Javier Antoran, Austin Tripp, Alexander Terenin, Csaba Szepesvari, José Miguel Hernández-Lobato, and David Janz. 2024. Stochastic Gradient Descent for Gaussian Pro- cesses Done Right. InThe Twelfth International Conference on Learning Representations

  25. [25]

    Wenxiang Lin, Xinglin Pan, Shaohuai Shi, Xuan Wang, Bo Li, and Xiaowen Chu. 2025. Mast: Efficient Training of Mixture-of-Experts Transformers with Task Pipelining and Ordering. In2025 IEEE 45th International Conference on Distributed Computing Systems. IEEE

  26. [26]

    Wenxiang Lin, Xinglin Pan, Lin Zhang, Shaohuai Shi, Xuan Wang, and Xiaowen Chu. 2025. HierMoE: Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap.arXiv preprint arXiv:2508.09591(2025). 13 arxiv, 2026 X.et al

  27. [27]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al

  28. [28]

    Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437 (2024)

  29. [29]

    Shuang Ma, Chon Lam Lao, Zhiying Xu, Zhuang Wang, Ziming Mao, Delong Meng, Jia Zhen, Jun Wu, Ion Stoica, Yida Wang, and Yang Zhou. 2026. UCCL-Zip: Lossless Compression Supercharged GPU Communication.arXiv e-prints(April 2026)

  30. [30]

    Diamos, Erich Elsen, David García, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David García, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed Precision Training. InICLR (Poster). OpenReview.net

  31. [31]

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient large-scale language model training on gpu clusters using megatron- lm. InProceedings of the international conference for high performance computing, netwo...

  32. [32]

    NVIDIA. 2025. nvcomp: Repository for nvCOMP docs and examples. https://github.com/NVIDIA/nvcomp. (2025). Accessed: 2025-08-18

  33. [33]

    NVIDIA. 2026. NCCL (pronounced "Nickel") is a stand-alone library of standard communication routines for GPUs. (2026). https://github. com/NVIDIA/nccl

  34. [34]

    Xinglin Pan, Wenxiang Lin, Lin Zhang, Shaohuai Shi, Zhenheng Tang, Rui Wang, Bo Li, and Xiaowen Chu. 2025. FSMoE: A flexible and scalable training system for sparse mixture-of-experts models. InPro- ceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 524–539

  35. [35]

    Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Phillip B Gibbons, Michael A Kozuch, and Todd C Mowry

  36. [36]

    InProceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

    Linearly compressed pages: A low-complexity, low-latency main memory compression framework. InProceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. 172–184

  37. [37]

    Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Phillip B Gibbons, Michael A Kozuch, and Todd C Mowry. 2012. Base-delta-immediate compression: Practical data compression for on-chip caches. InPro- ceedings of the 21st international conference on Parallel architectures and compilation techniques. 377–388

  38. [38]

    Joshua Peterson, Stephan Meylan, and David Bourgin. 2019. Open clone of openai’s unreleased webtext dataset scraper. (2019)

  39. [39]

    Rolf Rabenseifner. 2004. Optimization of collective reduction opera- tions. InInternational Conference on Computational Science. Springer, 1–9

  40. [40]

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2016. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In International Conference on Learning Representations

  41. [41]

    Shaohuai Shi, Xiaowen Chu, Ka Chun Cheung, and Simon See. 2019. Understanding top-k sparsification in distributed deep learning.arXiv preprint arXiv:1911.08772(2019)

  42. [42]

    Shaohuai Shi, Xinglin Pan, Xiaowen Chu, and Bo Li. 2023. PipeMoE: Accelerating Mixture-of-Experts through Adaptive Pipelining. InIEEE INFOCOM 2023-IEEE Conference on Computer Communications

  43. [43]

    Shaohuai Shi, Xinglin Pan, Qiang Wang, Chengjian Liu, Xiaozhe Ren, Zhongzhe Hu, Yu Yang, Bo Li, and Xiaowen Chu. 2024. ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling. InProceedings of the Nineteenth European Conference on Computer Systems. 236–249

  44. [44]

    Chongjie Si, Jingjing Jiang, and Wei Shen. 2025. Unveiling the mystery of weight in large foundation models: Gaussian distribution never fades.arXiv preprint arXiv:2501.10661(2025)

  45. [45]

    Qwen Team. 2025. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388(2025)

  46. [46]

    Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Opti- mization of collective communication operations in MPICH.The International Journal of High Performance Computing Applications19, 1 (2005), 49–66

  47. [47]

    Daniel Waddington and Cornel Constantinescu. 2025. Lossless Com- pression for LLM Tensor Incremental Snapshots.arXiv preprint arXiv:2505.09810(2025)

  48. [48]

    Devanur, and Ion Stoica

    Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Jorgen Thelin, Nikhil R. Devanur, and Ion Stoica. 2020. Blink: Fast and Generic Collectives for Distributed ML. InMLSys. mlsys.org

  49. [49]

    Ne Wang, Wenxiang Lin, Lin Zhang, Shaohuai Shi, Ruiting Zhou, and Bo Li. 2025. SP-MoE: Expediting Mixture-of-Experts Training with Optimized Pipelining Planning. InIEEE INFOCOM 2025-IEEE Conference on Computer Communications

  50. [50]

    Haocheng Xi, Han Cai, Ligeng Zhu, Yao Lu, Kurt Keutzer, Jianfei Chen, and Song Han. 2025. COAT: Compressing Optimizer states and Activations for Memory-Efficient FP8 Training. InThe Thirteenth International Conference on Learning Representations

  51. [51]

    Tianyi Zhang, Yang Sui, Shaochen Zhong, Vipin Chaudhary, Xia Hu, and Anshumali Shrivastava. 2025. 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float.arXiv preprint arXiv:2504.11651(2025)

  52. [52]

    Jishen Zhao, Sheng Li, Jichuan Chang, John L Byrne, Laura L Ramirez, Kevin Lim, Yuan Xie, and Paolo Faraboschi. 2015. Buri: Scaling big- memory computing with hardware-based memory expansion.ACM Transactions on Architecture and Code Optimization (TACO)12, 3 (2015), 1–24

  53. [53]

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. 2023. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.Proceedings of the VLDB Endowment16, 12 (2023), 3848–3860. 14