Recognition: unknown
TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network
read the original abstract
Decentralized training of large language models offers the opportunity to pool computational resources across geographically distributed participants, but is often bottlenecked by network communication, particularly under pipeline parallel settings. While pipeline parallelism partitions model layers across devices to handle large-scale models, it necessitates frequent communication of intermediate activations, creating challenges when network bandwidth is limited. To address these issues, we propose TAH-Quant (Tile-wise Adaptive Hadamard Quantization), a novel activation quantization framework for pipeline parallelism. TAH-Quant integrates fine-grained tile-wise quantization, entropy-guided tile-wise adaptive bit allocation for optimal bit usage, and a Hadamard-based transformation with pivot swapping to effectively suppress outliers. Compared with token-level allocation, the tile-wise allocator assigns precision at the granularity of small channel windows within each token, reducing quantization error under the same bit budget. We prove that pipeline parallel training equipped with TAH-Quant maintains a convergence rate of O(1/sqrt(T)), matching that of vanilla stochastic gradient descent. Extensive experiments demonstrate that TAH-Quant achieves an aggressive activation quantization ratio of 3-4 bits, providing up to 4.3x throughput speedup over uncompressed FP32 and up to 1.33x wall-clock speedup over AQ-SGD, while preserving training convergence, avoiding AQ-SGD's activation-cache overhead, and generalizing well across various training scenarios.
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding
NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
-
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training
TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.
-
Evolution of Optimization Methods: Algorithms, Scenarios, and Evaluations
A retrospective survey and empirical evaluation of deep learning optimization algorithms that identifies trends, design trade-offs, and future directions.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.