QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

Amirkeivan Mohtashami; Bo Li; Dan Alistarh; James Hensman; Martin Jaggi; Maximilian L. Croci; Pashmina Cameron; Saleh Ashkboos; Torsten Hoefler

arxiv: 2404.00456 · v2 · pith:WABLRAORnew · submitted 2024-03-30 · 💻 cs.LG

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

Saleh Ashkboos , Amirkeivan Mohtashami , Maximilian L. Croci , Bo Li , Pashmina Cameron , Martin Jaggi , Dan Alistarh , Torsten Hoefler

show 1 more author

James Hensman

This is my paper

classification 💻 cs.LG

keywords quarotllmsquantizationwithoutactivationsbitscachehidden

0 comments

read the original abstract

We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to quantize LLMs end-to-end, including all weights, activations, and KV cache in 4 bits. QuaRot rotates LLMs in a way that removes outliers from the hidden state without changing the output, making quantization easier. This computational invariance is applied to the hidden state (residual) of the LLM, as well as to the activations of the feed-forward components, aspects of the attention mechanism, and to the KV cache. The result is a quantized model where all matrix multiplications are performed in 4 bits, without any channels identified for retention in higher precision. Our 4-bit quantized LLaMa2-70B model has losses of at most 0.47 WikiText-2 perplexity and retains 99% of the zero-shot performance. We also show that QuaRot can provide lossless 6 and 8 bit LLaMa2 models without any calibration data using round-to-nearest quantization. Code is available at: https://github.com/spcl/QuaRot.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
cs.LG 2026-05 conditional novelty 8.0

HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.
The Speedup Paradox: Rethinking Inference Speed-Quality Trade-off in Embodied Tasks
cs.RO 2026-06 unverdicted novelty 7.0

TISED framework reveals paradoxical effects where inference optimizations can lengthen task completion time on static tasks or raise success rates on dynamic tasks in embodied AI.
STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control
cs.LG 2026-06 unverdicted novelty 7.0

STAR-KV applies differentiable soft thresholding for per-head and per-block adaptive low-rank KV cache compression, combined with hybrid decomposition and low-rank-aware quantization, achieving up to 75% compression a...
{\Omega}-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling
cs.CV 2026-05 unverdicted novelty 7.0

Omega-QVLA is a post-training quantization framework achieving uniform W4A4 for VLA models' LLM backbone and DiT action head via composite SVD-Hadamard rotation and per-step scaling, matching FP16 success rates on LIBERO.
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
cs.PF 2026-05 unverdicted novelty 7.0

A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
cs.LG 2026-02 unverdicted novelty 7.0

QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memor...
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling
cs.CL 2025-12 conditional novelty 7.0

Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.
The Speedup Paradox: Rethinking Inference Speed-Quality Trade-off in Embodied Tasks
cs.RO 2026-06 unverdicted novelty 6.0

TISED decomposes inference optimization effects on embodied tasks and identifies paradoxical outcomes where faster per-step inference can increase task completion time on static tasks or raise success rates on dynamic tasks.
GRINQH: Graded Input-based Quantization Hierarchy for Efficient LLM Generation
cs.LG 2026-06 unverdicted novelty 6.0

GRINQH introduces a graded input-based quantization hierarchy that dynamically assigns multi-precision weights using activation magnitudes as importance proxy, unifying quantization with sparsification to improve LLM ...
MonaVec: A Training-Free Embedded Vector Search Kernel for Edge and Offline AI Systems
cs.IR 2026-06 unverdicted novelty 6.0

MonaVec provides a training-free 4-bit vector quantization and deterministic search kernel using Randomized Hadamard Transform and ChaCha20 seeding for embedded and offline use.
Qift: Shift-Friendly No-Zero W2 Post-Training Quantization for Rotated W2A4/KV4 LLM Inference
cs.LG 2026-06 unverdicted novelty 6.0

Qift defines a fixed no-zero W2 level set for rotated weights that improves W2A4 perplexity and accuracy on LLaMA-2-7B and LLaMA-3.1-8B over the standard {-2,-1,0,1} set.
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
cs.LG 2026-05 unverdicted novelty 6.0

HeadQ reduces 84-94% of excess perplexity in 2-bit key quantization by adding low-rank logit corrections in a calibration-learned query basis, with further gains from an A^2-weighted value policy.
Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization
cs.LG 2026-04 unverdicted novelty 6.0

ARHQ isolates error-sensitive weight directions in LLMs via truncated SVD on the scaled matrix W G_x^{1/2} from activation residuals, improving SNR and preserving performance under aggressive low-bit quantization.
MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs
cs.AR 2026-04 unverdicted novelty 6.0

MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.
Rethinking Residual Errors in Compensation-based LLM Quantization
cs.LG 2026-04 conditional novelty 6.0

Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.
Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism
cs.CL 2026-01 unverdicted novelty 6.0

Double achieves up to 5.3x inference speedup on 70B LLMs via synchronous double retrieval speculative parallelism that is lossless and outperforms trained baselines like EAGLE-3.
You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations
cs.CL 2025-11 conditional novelty 6.0

TAQ estimates per-layer importance from hidden representations and output sensitivity on task calibration data to allocate mixed precision in a training-free PTQ setting, outperforming task-agnostic baselines on accur...
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
cs.LG 2025-04 unverdicted novelty 6.0

TurboQuant achieves near-optimal vector quantization distortion for both MSE and inner products via random rotation and per-coordinate scalar quantization, with a formal proof that it matches lower bounds within a fac...
SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices
cs.CL 2026-06 unverdicted novelty 5.0

Learned diagonal scaling matrices optimized with activation-aware loss reduce effective rank in LLM weight matrices and yield competitive perplexity and zero-shot results versus prior SVD methods on Llama 3.1 8B and Qwen3-8B.
QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer
cs.CV 2026-05 unverdicted novelty 5.0

QVGGT uses per-block mixed-precision analysis, outlier token filtering with PCA compensation, and task-aware scale search to achieve near-lossless W4A16 quantization of VGGT with 3-4.9x memory savings and 2.8x speedup.
31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding
cs.AR 2026-05 unverdicted novelty 5.0

A ReRAM-on-logic stacked chip delivers 14.08-135.69 tokens/s LLM inference with block-clustered compression and adaptive parallel speculative decoding, yielding 4.46-7.17x speedup over standard methods.
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
cs.LG 2026-05 unverdicted novelty 5.0

HeadQ applies score-space logit corrections for keys and attention-weighted surrogates for values to KV-cache quantization, removing 84-94% of excess perplexity in 2-bit key experiments across six models.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
cs.LG 2026-05 accept novelty 5.0

Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
High-Rate Quantized Matrix Multiplication I
cs.IT 2026-01 unverdicted novelty 5.0

High-rate quantization theory yields accurate approximations for the distortion of absmax INT and FP schemes in generic weight-plus-activation matrix multiplication.
Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference
cs.AR 2025-09 unverdicted novelty 5.0

PLENA introduces a co-designed system with three optimization pathways for long-context agentic LLM inference, claiming up to 2.23x throughput over A100 and 4.04x energy efficiency.
MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design
cs.LG 2024-12 unverdicted novelty 5.0

MixLLM uses global output-feature importance to set mixed bit-widths for LLM quantization and adds two-step dequantization plus software pipelining for system efficiency.
Influence-Inspired Spectral Rotations for Extreme Low-Bit LLM Quantization
cs.LG 2026-05 unverdicted novelty 4.0

A WHT rotation plus per-coordinate activation-energy rescaling before auto-round quantization lowers WikiText-2 perplexity 15-58% versus vanilla auto-round at W2A16 on models from 135M to 1.5B parameters.
DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization
cs.CV 2026-04 unverdicted novelty 4.0

DuQuant++ adapts outlier-aware fine-grained rotation to MXFP4 by matching block size to the 32-element microscaling group, enabling a single rotation that smooths distributions and achieves SOTA performance on LLaMA-3...
A Survey on Efficient Inference for Large Language Models
cs.CL 2024-04 accept novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.