Quantize What Counts: More for Keys, Less for Values
Pith reviewed 2026-05-23 01:47 UTC · model grok-4.3
The pith
For any memory budget, giving more precision to attention keys than values reduces quantization error in Transformers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors prove that key matrices exhibit systematically larger norms than value matrices in Transformers, and that this geometric property implies that, under a fixed memory constraint, bit allocation favoring keys over values minimizes the overall quantization error and better preserves model performance.
What carries the argument
Two theorems linking the spectral and Frobenius norms of key and value projections to optimal mixed-precision bit allocation in the KV cache.
If this is right
- Key-favored allocations (e.g., 4-bit keys, 2-bit values) achieve up to 98.3% accuracy retention versus uniform quantization.
- Memory is conserved without proportional accuracy loss across prominent LLMs and benchmarks.
- Bit allocation shifts from heuristic tuning to a geometry-driven principle.
- The method applies generally to efficient LLM inference under memory constraints.
Where Pith is reading between the lines
- The norm-based priority might extend to other matrix pairs in neural networks where similar geometric disparities exist.
- Dynamic quantization schemes could monitor norm ratios per layer to adjust bits on the fly.
- This principle may interact with other compression techniques like pruning to yield compounded efficiency gains.
Load-bearing premise
Larger norms in key projections correspond to higher information density that merits priority in bit allocation.
What would settle it
A counterexample where allocating equal or more bits to values yields lower quantization error or higher accuracy than key-prioritized allocation under the same memory budget.
read the original abstract
Large Language Models (LLMs) suffer inference-time memory bottlenecks dominated by the attention Key-Value (KV) cache, which scales with model size and context length. While KV-cache quantization alleviates this cost, bit allocation between keys and values is often tuned heuristically, lacking theoretical grounding and generalizability. This paper proposes two theorems that anchor mixed-precision KV quantization in the intrinsic geometry of Transformer models. First, key projections systematically have larger spectral and Frobenius norms than value matrices, implying higher information density along the key path. Second, for any given memory budget, prioritizing precision for keys over values strictly reduces quantization error and better preserves accuracy. Empirical evaluations across various prominent LLMs and benchmarks show that key-favored allocations (e.g., 4-bit keys, 2-bit values) retain up to 98.3\% accuracy compared to uniform allocations (e.g., 4-bit for both), while conserving memory. These results transform bit allocation from ad hoc tuning into a theoretically grounded, geometry-driven design principle for efficient LLM inference. Source code is available at https://github.com/mohsenhariri/spectral-kv.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that in Transformer attention, key projection matrices have systematically larger spectral and Frobenius norms than value matrices (Theorem 1), implying higher information density along the key path, and that for any fixed memory budget, allocating more bits to keys than values strictly reduces quantization error while better preserving accuracy (Theorem 2). Empirical results across LLMs show key-favored allocations (e.g., 4-bit keys + 2-bit values) retain up to 98.3% of baseline accuracy versus uniform quantization while saving memory; source code is released.
Significance. If the central theorems hold, the work supplies a geometry-driven principle for KV-cache bit allocation that replaces heuristic tuning, with direct implications for memory-efficient inference. The release of source code strengthens reproducibility and allows independent verification of the empirical claims.
major comments (2)
- [Theorem 1] Theorem 1: the statement that larger key norms imply higher information density is asserted from the spectral/Frobenius norm comparison, yet no quantitative definition of information density nor any bound relating norm magnitude to quantization error under a given operator is supplied.
- [Theorem 2] Theorem 2: the claim that key-favored allocation strictly reduces quantization error for any memory budget is presented as following from Theorem 1, but the manuscript contains no derivation showing how the observed norm difference, when combined with a quantization operator and the memory constraint, produces a strictly lower error bound; the step appears to rely on an implicit assumption that error scales directly with matrix norm.
minor comments (1)
- The abstract reports a peak accuracy retention of 98.3% but does not name the specific models, tasks, or uniform-allocation baselines used to obtain that figure; adding this detail would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating where revisions will be made to improve clarity and rigor.
read point-by-point responses
-
Referee: [Theorem 1] Theorem 1: the statement that larger key norms imply higher information density is asserted from the spectral/Frobenius norm comparison, yet no quantitative definition of information density nor any bound relating norm magnitude to quantization error under a given operator is supplied.
Authors: We appreciate this observation. Theorem 1 provides a rigorous proof of the spectral and Frobenius norm inequalities between key and value projection matrices. The term 'information density' was used informally to convey that larger norms suggest greater sensitivity of the attention output to perturbations along the key path. We agree that this phrasing lacks a formal quantitative definition and that no explicit bound is derived linking the norm magnitude directly to quantization error. In the revised manuscript we will qualify or remove the term 'information density' and instead emphasize only the proven norm comparison together with its empirical consequences for bit allocation. revision: yes
-
Referee: [Theorem 2] Theorem 2: the claim that key-favored allocation strictly reduces quantization error for any memory budget is presented as following from Theorem 1, but the manuscript contains no derivation showing how the observed norm difference, when combined with a quantization operator and the memory constraint, produces a strictly lower error bound; the step appears to rely on an implicit assumption that error scales directly with matrix norm.
Authors: We acknowledge the validity of this critique. The reasoning in Theorem 2 relies on the norm disparity established in Theorem 1 to argue that, under a fixed memory budget, higher precision for keys yields lower overall quantization error. However, the manuscript does not supply an explicit derivation that combines the norm difference with a specific quantization operator and the memory constraint to obtain a strict error bound. We will add a clarifying paragraph or short supporting argument in the revision that makes the connection between matrix scale (via norms) and quantization error more explicit, while retaining the empirical validation as the primary evidence for the practical benefit. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper anchors its claims in two theorems derived from matrix properties of Transformer models: Theorem 1 observes systematically larger spectral and Frobenius norms for key projections versus value matrices as an intrinsic geometric fact, and Theorem 2 concludes that key-favored bit allocation under a fixed memory budget reduces quantization error. These steps are presented as following from first-principles geometry rather than from fitted parameters, self-referential definitions, or load-bearing self-citations. No equations or text in the provided abstract and summary reduce the target result to its own inputs by construction, and the derivation chain remains independent of the empirical outcomes it motivates.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Key projections have larger spectral and Frobenius norms than value matrices in Transformer models.
Forward citations
Cited by 5 Pith papers
-
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.
-
RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory
RateQuant delivers optimal mixed-precision KV cache quantization by per-quantizer distortion fitting followed by closed-form reverse waterfilling, reducing perplexity by 70% versus KIVI at 2.5 average bits on Qwen3-8B.
-
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.
-
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
HeadQ reduces 84-94% of excess perplexity in 2-bit key quantization by adding low-rank logit corrections in a calibration-learned query basis, with further gains from an A^2-weighted value policy.
-
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
HeadQ applies score-space logit corrections for keys and attention-weighted surrogates for values to KV-cache quantization, removing 84-94% of excess perplexity in 2-bit key experiments across six models.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.