TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network

Binhang Yuan, Guangxin He, Kai Chen, Kun Yuan, Tianyi Bai, Yuan Cao, Yutong He

classification 💻 cs.LG

keywords quantizationpipelinetah-quanttile-wisetrainingacrossactivationnetwork

read the original abstract

Decentralized training of large language models offers the opportunity to pool computational resources across geographically distributed participants, but is often bottlenecked by network communication, particularly under pipeline parallel settings. While pipeline parallelism partitions model layers across devices to handle large-scale models, it necessitates frequent communication of intermediate activations, creating challenges when network bandwidth is limited. To address these issues, we propose TAH-Quant (Tile-wise Adaptive Hadamard Quantization), a novel activation quantization framework for pipeline parallelism. TAH-Quant integrates fine-grained tile-wise quantization, entropy-guided tile-wise adaptive bit allocation for optimal bit usage, and a Hadamard-based transformation with pivot swapping to effectively suppress outliers. Compared with token-level allocation, the tile-wise allocator assigns precision at the granularity of small channel windows within each token, reducing quantization error under the same bit budget. We prove that pipeline parallel training equipped with TAH-Quant maintains a convergence rate of O(1/sqrt(T)), matching that of vanilla stochastic gradient descent. Extensive experiments demonstrate that TAH-Quant achieves an aggressive activation quantization ratio of 3-4 bits, providing up to 4.3x throughput speedup over uncompressed FP32 and up to 1.33x wall-clock speedup over AQ-SGD, while preserving training convergence, avoiding AQ-SGD's activation-cache overhead, and generalizing well across various training scenarios.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding
cs.DC 2026-05 unverdicted novelty 7.0

NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training
cs.DC 2026-04 unverdicted novelty 5.0

TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.
Evolution of Optimization Methods: Algorithms, Scenarios, and Evaluations
cs.LG 2026-04 unverdicted novelty 3.0

A retrospective survey and empirical evaluation of deep learning optimization algorithms that identifies trends, design trade-offs, and future directions.