Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
Understanding and overcoming the challenges of efficient transformer quantization
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
MUXQ uses low-rank outlier decomposition to redistribute activation outliers, allowing mixed-to-uniform INT8 quantization of LLMs with lower perplexity than naive methods on GPT-2 models.
Log_b Quant is an adjustable-base logarithmic quantization technique that outperforms tensor-wise asymmetric linear quantization at 4-bit precision on language model benchmarks while providing memory savings.
citing papers explorer
-
Massive Activations in Large Language Models
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
-
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
-
MUXQ: Mixed-to-Uniform Precision MatriX Quantization via Low-Rank Outlier Decomposition
MUXQ uses low-rank outlier decomposition to redistribute activation outliers, allowing mixed-to-uniform INT8 quantization of LLMs with lower perplexity than naive methods on GPT-2 models.
-
$\text{Log}_\text{b}$Quant: Quantizing Language Models in Logarithmic Space
Log_b Quant is an adjustable-base logarithmic quantization technique that outperforms tensor-wise asymmetric linear quantization at 4-bit precision on language model benchmarks while providing memory savings.