arxiv: 2604.11501 · v1 · submitted 2026-04-13 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Quantization Dominates Rank Reduction for KV-Cache Compression

Samuel Salfati

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords KV cache compressionquantizationrank reductiontransformer inferencesoftmax attentionperplexityperturbation analysisgrouped query attention

0 comments

The pith

Quantization of the KV cache outperforms rank reduction because it avoids flipping which token softmax attends to.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares two ways to shrink the key-value cache that transformers store during inference. One discards entire dimensions of the key and value vectors; the other keeps every dimension but stores each number with fewer bits. Across five models ranging from 124 million to 14 billion parameters, the second approach yields substantially lower perplexity at every matched storage budget. The authors trace the difference to the mechanics of softmax attention: dropping a dimension can reorder attention scores so that the model attends to the wrong previous token, whereas the bounded error from quantization rarely changes which token receives the highest score.

Core claim

At identical storage budgets quantization consistently outperforms rank reduction by 4 to 364 perplexity points. The gap remains even when the two techniques are combined and widens under grouped-query attention. A perturbation analysis shows that the damage from projection exceeds the damage from b-bit quantization by a factor of 3 times 2 to the power of 2b per direction when measured under the softmax Fisher metric. Joint 4-bit quantization of keys and values together delivers 75 percent KV-cache reduction while adding only 0.18 perplexity on Mistral 7B; the advantage is independent of the basis chosen for any rank reduction.

What carries the argument

The structural asymmetry under softmax attention routing, formalized by a perturbation bound showing that projection damage exceeds quantization damage by 3 x 2^(2b) per direction under the softmax Fisher metric.

If this is right

INT4 quantization matches FP16 perplexity on LAMBADA while rank-32 at the same storage budget collapses performance.
The advantage of quantization over rank reduction grows as grouped-query attention becomes more aggressive.
Hybrid baselines that combine rank reduction with quantization still underperform pure quantization.
Joint 4-bit quantization of keys and values together achieves 75 percent total KV reduction with only +0.18 PPL on Mistral 7B.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Engineers building long-context inference systems should default to quantization rather than low-rank approximations for KV cache compression.
The ordering-preservation argument may apply to compression of other softmax-based routing layers in neural networks.
Future method design could focus on error models that explicitly bound changes to attention argmax rather than on coordinate-system choice.

Load-bearing premise

The structural asymmetry under softmax attention and the perturbation bound generalize beyond the five tested models and the LAMBADA task.

What would settle it

A controlled experiment on a new model or dataset in which rank reduction at a given storage budget produces lower perplexity than quantization at the same budget would falsify the claimed dominance.

Figures

Figures reproduced from arXiv: 2604.11501 by Samuel Salfati.

**Figure 2.** Figure 2: Softmax decision boundary. (a) Two tokens compete. (b) INT4 adds bounded noise—stays on correct side (flip rate 0.03% at gap >0.05). (c) Rank reduction removes a dimension—crosses the boundary (flip rate 4.6%) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Attention routing disruption (KL divergence) predicts compression damage. Monotonic [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

We compare two strategies for compressing the KV cache in transformer inference: rank reduction (discard dimensions) and quantization (keep all dimensions, reduce precision). At matched storage budgets across five models (124M-14B, MHA and GQA), we find that quantization consistently outperforms rank reduction by 4-364 PPL depending on model and compression level. The gap persists even when rank reduction is combined with quantization in hybrid baselines, and it grows with GQA aggressiveness. On LAMBADA, INT4 matches FP16 accuracy (+0.23 PPL on Mistral 7B, +0.58 on GPT-2) while rank-32 at identical storage collapses to 0.4%. We trace this gap to a structural asymmetry: under softmax attention routing, removing a dimension can flip which token is attended (a discrete failure), while quantization noise is bounded and typically preserves score ordering. We formalize this via a perturbation result showing projection damage exceeds quantization damage by 3 x 2^(2b) per direction under the softmax Fisher metric. A basis ablation confirms the finding is basis-independent (spread <0.4 PPL), establishing that the advantage comes from preserving dimensions, not from a better coordinate system. Joint K+V INT4 quantization achieves 75% total KV reduction at only +0.18 PPL on Mistral 7B.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that quantization outperforms rank reduction for KV-cache compression in transformers at matched storage budgets. Across five models (124M-14B parameters, MHA and GQA), quantization yields consistent perplexity advantages of 4-364 on LAMBADA, with the gap persisting in hybrid baselines and increasing with GQA aggressiveness. INT4 quantization nearly matches FP16 accuracy while rank-32 at equal budget collapses performance. The authors attribute the gap to a structural asymmetry under softmax attention: dimension removal can discretely flip argmax tokens, whereas quantization noise is bounded and preserves ordering. This is formalized via a perturbation result under the softmax Fisher metric showing projection damage exceeds quantization damage by 3 × 2^(2b) per direction. A basis ablation (<0.4 PPL spread) confirms the advantage stems from preserving dimensions rather than coordinate choice. Joint K+V INT4 achieves 75% KV reduction at +0.18 PPL on Mistral 7B.

Significance. If the empirical gaps and perturbation analysis hold, the result is significant for LLM inference efficiency, indicating that low-precision retention of all dimensions is preferable to rank reduction for KV caches. Strengths include consistent results across multiple model sizes and architectures, hybrid baselines, and the basis ablation establishing dimension preservation as the key factor. The proposed mechanism offers a plausible explanation for why quantization is more robust under softmax routing. However, the limited scope to LAMBADA and the unverified details of the formal derivation reduce the immediate impact; broader validation would elevate the contribution to guiding practical compression techniques.

major comments (3)

[theoretical analysis] The perturbation result (abstract and theoretical section) is load-bearing for the explanatory claim that projection damage exceeds quantization damage by 3 × 2^(2b) under the softmax Fisher metric. The abstract-only presentation leaves the full derivation, assumptions on attention score distributions, and any error bounds unverified; the manuscript must include the step-by-step derivation and a check against the empirical attention scores from the tested models to confirm it is not circular with the LAMBADA results.
[Experiments] §Experiments (LAMBADA results): the 4-364 PPL gaps and their growth with GQA are reported on a single task focused on long-range last-token prediction. This distribution may amplify dimension sensitivity in ways not representative of other tasks or data; to support the general claim that quantization dominates rank reduction, evaluation on additional benchmarks (e.g., short-context or diverse distributions) is required, as the current scope limits extrapolation to arbitrary scales and tasks.
[Experiments] Hybrid baseline results (abstract): while the gap persists when combining rank reduction with quantization, the specific compression ratios, exact PPL values per model, and whether the hybrid matches the pure quantization budget exactly are not detailed enough to rule out confounding factors in the storage matching; a table with per-model, per-level breakdowns including the hybrid would strengthen the claim that the asymmetry is structural rather than implementation-specific.

minor comments (2)

[Abstract] The notation for the factor 3 × 2^(2b) should be defined explicitly in the main text (what b represents, e.g., bits) with a reference to the relevant equation, as the abstract leaves it ambiguous for readers.
[Experiments] The basis ablation reports <0.4 PPL spread but does not specify the exact bases tested or the number of trials; adding this detail would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and have revised the manuscript to incorporate the requested clarifications and expansions.

read point-by-point responses

Referee: [theoretical analysis] The perturbation result (abstract and theoretical section) is load-bearing for the explanatory claim that projection damage exceeds quantization damage by 3 × 2^(2b) under the softmax Fisher metric. The abstract-only presentation leaves the full derivation, assumptions on attention score distributions, and any error bounds unverified; the manuscript must include the step-by-step derivation and a check against the empirical attention scores from the tested models to confirm it is not circular with the LAMBADA results.

Authors: We agree that a complete, self-contained derivation is necessary to substantiate the perturbation analysis. The revised manuscript will include the full step-by-step derivation of the result under the softmax Fisher metric, with explicit statements of all assumptions on attention score distributions and the associated error bounds. We will also add a new subsection that computes the theoretical damage predictions and directly compares them to empirical attention score perturbations measured on the same models and LAMBADA data, confirming independence from the main experimental results. revision: yes
Referee: [Experiments] §Experiments (LAMBADA results): the 4-364 PPL gaps and their growth with GQA are reported on a single task focused on long-range last-token prediction. This distribution may amplify dimension sensitivity in ways not representative of other tasks or data; to support the general claim that quantization dominates rank reduction, evaluation on additional benchmarks (e.g., short-context or diverse distributions) is required, as the current scope limits extrapolation to arbitrary scales and tasks.

Authors: We acknowledge that LAMBADA's emphasis on long-range dependencies may not fully represent all regimes. In the revision we will add perplexity results on WikiText-103 and zero-shot accuracy on a short-context benchmark (e.g., PIQA) for the same models and compression settings. These additional evaluations will test whether the quantization advantage persists under different context lengths and task distributions. revision: yes
Referee: [Experiments] Hybrid baseline results (abstract): while the gap persists when combining rank reduction with quantization, the specific compression ratios, exact PPL values per model, and whether the hybrid matches the pure quantization budget exactly are not detailed enough to rule out confounding factors in the storage matching; a table with per-model, per-level breakdowns including the hybrid would strengthen the claim that the asymmetry is structural rather than implementation-specific.

Authors: We agree that the hybrid results require more granular reporting to eliminate any ambiguity about budget matching. The revised manuscript will include a new table in the experiments section that reports, for every model and compression level, the exact storage budget, the PPL for pure quantization, pure rank reduction, and the hybrid combination, together with a column confirming that the hybrid budget equals the pure-quantization budget. This will make the structural nature of the gap explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results and perturbation bound are independently derived

full rationale

The paper's core claim rests on direct experimental comparisons of quantization vs. rank reduction across five models on LAMBADA, with hybrid baselines and basis ablations. The structural asymmetry is formalized as a perturbation bound (projection damage exceeding quantization by 3 x 2^(2b) under softmax Fisher metric), presented as a first-principles derivation from attention routing properties rather than any fit to the reported PPL numbers or self-citation chain. No load-bearing step reduces by construction to inputs, fitted parameters renamed as predictions, or author-overlapping citations; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard transformer attention mechanics and empirical testing across models; the perturbation analysis is framed as a derivation rather than an ad-hoc fit.

free parameters (1)

matched storage budget
Compression levels are chosen to equate memory usage between methods but are not derived from data.

axioms (1)

domain assumption Softmax attention routing depends on preserving relative score ordering among tokens
Invoked to explain why dimension removal causes discrete failures while quantization does not.

pith-pipeline@v0.9.0 · 5542 in / 1334 out tokens · 50049 ms · 2026-05-10T15:39:34.693305+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 8 canonical work pages · 5 internal anchors

[1]

Ainslie, J., et al. GQA. arXiv:2305.13245, 2023

work page internal anchor Pith review arXiv 2023
[2]

Frantar, E., et al. GPTQ. arXiv:2210.17323, 2022

work page internal anchor Pith review arXiv 2022
[3]

Hsu, Y.-C., et al. FWSVD. ICLR, 2022

2022
[4]

Kang, H., et al. GEAR. arXiv:2403.05527, 2024

work page arXiv 2024
[5]

Lesens, D., et al. KQ-SVD. arXiv:2512.05916, 2025

work page arXiv 2025
[6]

Lin, J., et al. AWQ. arXiv:2306.00978, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Scissorhands

Liu, Z., et al. Scissorhands. NeurIPS, 2023

2023
[8]

Liu, Z., et al. KIVI. arXiv:2402.02750, 2024

work page internal anchor Pith review arXiv 2024
[9]

Pointer Sentinel Mixture Models

Merity, S., et al. Pointer Sentinel Mixture Models. arXiv:1609.07843, 2016

work page internal anchor Pith review arXiv 2016
[10]

Paperno, D., et al. LAMBADA. ACL, 2016

2016
[11]

MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection

Yang, L., et al. MatryoshkaKV. arXiv:2410.14731, 2024

work page arXiv 2024
[12]

H2O: Heavy-Hitter Oracle for Efficient Generative Inference

Zhang, Z., et al. H2O: Heavy-Hitter Oracle for Efficient Generative Inference. NeurIPS, 2023

2023