Quantize What Counts: More for Keys, Less for Values

Alan Luo; Mohsen Hariri; Qifan Wang; Shaochen Zhong; Tianyi Zhang; Vipin Chaudhary; Weicong Chen; Xia Hu; Xiaotian Han

arxiv: 2502.15075 · v3 · submitted 2025-02-20 · 💻 cs.LG

Quantize What Counts: More for Keys, Less for Values

Mohsen Hariri , Alan Luo , Weicong Chen , Shaochen Zhong , Tianyi Zhang , Qifan Wang , Xia Hu , Xiaotian Han

show 1 more author

Vipin Chaudhary

This is my paper

Pith reviewed 2026-05-23 01:47 UTC · model grok-4.3

classification 💻 cs.LG

keywords KV-cache quantizationmixed-precision allocationTransformer geometryattention keys and valuesLLM inference optimizationspectral normsquantization error reduction

0 comments

The pith

For any memory budget, giving more precision to attention keys than values reduces quantization error in Transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes two theorems about the intrinsic geometry of Transformer models to guide KV-cache quantization. Key projections have larger spectral and Frobenius norms than value matrices, suggesting higher information density in the key path. As a result, for a fixed memory budget, allocating more bits to keys than to values strictly lowers quantization error and maintains higher accuracy. This approach is validated empirically on multiple LLMs where key-favored bit allocations like 4-bit keys and 2-bit values retain up to 98.3 percent accuracy compared to uniform 4-bit settings while saving memory.

Core claim

The authors prove that key matrices exhibit systematically larger norms than value matrices in Transformers, and that this geometric property implies that, under a fixed memory constraint, bit allocation favoring keys over values minimizes the overall quantization error and better preserves model performance.

What carries the argument

Two theorems linking the spectral and Frobenius norms of key and value projections to optimal mixed-precision bit allocation in the KV cache.

If this is right

Key-favored allocations (e.g., 4-bit keys, 2-bit values) achieve up to 98.3% accuracy retention versus uniform quantization.
Memory is conserved without proportional accuracy loss across prominent LLMs and benchmarks.
Bit allocation shifts from heuristic tuning to a geometry-driven principle.
The method applies generally to efficient LLM inference under memory constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The norm-based priority might extend to other matrix pairs in neural networks where similar geometric disparities exist.
Dynamic quantization schemes could monitor norm ratios per layer to adjust bits on the fly.
This principle may interact with other compression techniques like pruning to yield compounded efficiency gains.

Load-bearing premise

Larger norms in key projections correspond to higher information density that merits priority in bit allocation.

What would settle it

A counterexample where allocating equal or more bits to values yields lower quantization error or higher accuracy than key-prioritized allocation under the same memory budget.

read the original abstract

Large Language Models (LLMs) suffer inference-time memory bottlenecks dominated by the attention Key-Value (KV) cache, which scales with model size and context length. While KV-cache quantization alleviates this cost, bit allocation between keys and values is often tuned heuristically, lacking theoretical grounding and generalizability. This paper proposes two theorems that anchor mixed-precision KV quantization in the intrinsic geometry of Transformer models. First, key projections systematically have larger spectral and Frobenius norms than value matrices, implying higher information density along the key path. Second, for any given memory budget, prioritizing precision for keys over values strictly reduces quantization error and better preserves accuracy. Empirical evaluations across various prominent LLMs and benchmarks show that key-favored allocations (e.g., 4-bit keys, 2-bit values) retain up to 98.3\% accuracy compared to uniform allocations (e.g., 4-bit for both), while conserving memory. These results transform bit allocation from ad hoc tuning into a theoretically grounded, geometry-driven design principle for efficient LLM inference. Source code is available at https://github.com/mohsenhariri/spectral-kv.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is two theorems that use spectral norms to justify giving more bits to keys than values in KV quantization, supported by experiments.

read the letter

The main takeaway is that the authors derive two theorems from the spectral properties of key and value matrices in Transformers, showing that keys tend to have larger norms and that this justifies allocating more bits to keys than values under a fixed memory budget for lower quantization error. What stands out as new is the formalization of this geometry-driven priority, which moves beyond the usual trial-and-error in KV cache quantization. The paper also reports experiments across prominent LLMs where configurations like 4-bit keys and 2-bit values retain up to 98.3% accuracy versus uniform allocations, while using less memory overall. Releasing the code is a plus for reproducibility. The work does a reasonable job of grounding the bit allocation in matrix norms rather than heuristics, and the empirical results align with the claimed direction. On the soft spots, the connection in Theorem 1 from larger norms to higher information density, and then in Theorem 2 to strictly lower error, appears to rest on an assumption that error scales with the norm in the attention computation. The abstract does not include the full derivations, so it is not clear how tight the bounds are or whether the prioritization holds without additional assumptions about the quantization operator. The experiments support the idea but would benefit from more detail on baseline choices and whether the gains are consistent across different model scales or tasks. This paper is aimed at practitioners and researchers focused on LLM inference optimization. Someone looking for practical quantization strategies could find the rule useful, and the theoretical angle adds some depth. It is worth sending to peer review because the core idea is testable and the results are concrete, even if the proofs require careful checking.

Referee Report

2 major / 1 minor

Summary. The paper claims that in Transformer attention, key projection matrices have systematically larger spectral and Frobenius norms than value matrices (Theorem 1), implying higher information density along the key path, and that for any fixed memory budget, allocating more bits to keys than values strictly reduces quantization error while better preserving accuracy (Theorem 2). Empirical results across LLMs show key-favored allocations (e.g., 4-bit keys + 2-bit values) retain up to 98.3% of baseline accuracy versus uniform quantization while saving memory; source code is released.

Significance. If the central theorems hold, the work supplies a geometry-driven principle for KV-cache bit allocation that replaces heuristic tuning, with direct implications for memory-efficient inference. The release of source code strengthens reproducibility and allows independent verification of the empirical claims.

major comments (2)

[Theorem 1] Theorem 1: the statement that larger key norms imply higher information density is asserted from the spectral/Frobenius norm comparison, yet no quantitative definition of information density nor any bound relating norm magnitude to quantization error under a given operator is supplied.
[Theorem 2] Theorem 2: the claim that key-favored allocation strictly reduces quantization error for any memory budget is presented as following from Theorem 1, but the manuscript contains no derivation showing how the observed norm difference, when combined with a quantization operator and the memory constraint, produces a strictly lower error bound; the step appears to rely on an implicit assumption that error scales directly with matrix norm.

minor comments (1)

The abstract reports a peak accuracy retention of 98.3% but does not name the specific models, tasks, or uniform-allocation baselines used to obtain that figure; adding this detail would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses

Referee: [Theorem 1] Theorem 1: the statement that larger key norms imply higher information density is asserted from the spectral/Frobenius norm comparison, yet no quantitative definition of information density nor any bound relating norm magnitude to quantization error under a given operator is supplied.

Authors: We appreciate this observation. Theorem 1 provides a rigorous proof of the spectral and Frobenius norm inequalities between key and value projection matrices. The term 'information density' was used informally to convey that larger norms suggest greater sensitivity of the attention output to perturbations along the key path. We agree that this phrasing lacks a formal quantitative definition and that no explicit bound is derived linking the norm magnitude directly to quantization error. In the revised manuscript we will qualify or remove the term 'information density' and instead emphasize only the proven norm comparison together with its empirical consequences for bit allocation. revision: yes
Referee: [Theorem 2] Theorem 2: the claim that key-favored allocation strictly reduces quantization error for any memory budget is presented as following from Theorem 1, but the manuscript contains no derivation showing how the observed norm difference, when combined with a quantization operator and the memory constraint, produces a strictly lower error bound; the step appears to rely on an implicit assumption that error scales directly with matrix norm.

Authors: We acknowledge the validity of this critique. The reasoning in Theorem 2 relies on the norm disparity established in Theorem 1 to argue that, under a fixed memory budget, higher precision for keys yields lower overall quantization error. However, the manuscript does not supply an explicit derivation that combines the norm difference with a specific quantization operator and the memory constraint to obtain a strict error bound. We will add a clarifying paragraph or short supporting argument in the revision that makes the connection between matrix scale (via norms) and quantization error more explicit, while retaining the empirical validation as the primary evidence for the practical benefit. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper anchors its claims in two theorems derived from matrix properties of Transformer models: Theorem 1 observes systematically larger spectral and Frobenius norms for key projections versus value matrices as an intrinsic geometric fact, and Theorem 2 concludes that key-favored bit allocation under a fixed memory budget reduces quantization error. These steps are presented as following from first-principles geometry rather than from fitted parameters, self-referential definitions, or load-bearing self-citations. No equations or text in the provided abstract and summary reduce the target result to its own inputs by construction, and the derivation chain remains independent of the empirical outcomes it motivates.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the two theorems about matrix norms in the attention mechanism.

axioms (1)

domain assumption Key projections have larger spectral and Frobenius norms than value matrices in Transformer models.
This is the first theorem stated in the abstract as anchoring the quantization.

pith-pipeline@v0.9.0 · 5754 in / 1143 out tokens · 40256 ms · 2026-05-23T01:47:01.963522+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
cs.LG 2026-05 conditional novelty 8.0

HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.
RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory
cs.LG 2026-04 conditional novelty 7.0

RateQuant delivers optimal mixed-precision KV cache quantization by per-quantizer distortion fitting followed by closed-form reverse waterfilling, reducing perplexity by 70% versus KIVI at 2.5 average bits on Qwen3-8B.
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
cs.LG 2026-05 unverdicted novelty 6.0

OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
cs.LG 2026-05 unverdicted novelty 6.0

HeadQ reduces 84-94% of excess perplexity in 2-bit key quantization by adding low-rank logit corrections in a calibration-learned query basis, with further gains from an A^2-weighted value policy.
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
cs.LG 2026-05 unverdicted novelty 5.0

HeadQ applies score-space logit corrections for keys and attention-weighted surrogates for values to KV-cache quantization, removing 84-94% of excess perplexity in 2-bit key experiments across six models.