pith. machine review for the scientific record. sign in

arxiv: 2605.03562 · v2 · submitted 2026-05-05 · 💻 cs.LG · cs.AI

Recognition: 3 theorem links

· Lean Theorem

HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

Jorge L. Ruiz Williams

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords KV cache quantizationmodel-visible distortionscore-space correctionHeadQattention KL divergenceperplexity reduction2-bit quantizationlogit correction
0
0 comments X

The pith

KV-cache quantization should correct errors in the model's score space rather than minimizing storage reconstruction error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard KV-cache quantizers fail because they optimize for storage-space fidelity instead of the distortions that actually affect the model's attention and output. For keys, this means measuring and correcting score error (logit perturbations modulo constants), leading to the HeadQ method that learns a query basis during calibration and stores a low-rank residual to add back as a logit correction. For values, it uses an A squared weighted token distortion as surrogate. Across experiments on six models, this model-visible approach outperforms storage MSE baselines, recovering 84 to 94 percent of the perplexity penalty from 2-bit key quantization on WikiText-103 while keeping the same budget.

Core claim

HeadQ is a key quantization technique that stores a low-rank residual side code in a calibration-learned query basis and applies it as an additive logit correction, based on the premise that persistent cache error should be measured in model-visible coordinates (score error modulo constant shifts for keys). This, combined with an A^2-weighted surrogate for values, substantially reduces the performance degradation from aggressive quantization compared to minimizing raw key or value MSE.

What carries the argument

HeadQ, which stores a low-rank residual side code in a calibration-learned query basis and applies it as an additive logit correction to address score error.

If this is right

  • Score-space and Fisher error metrics predict attention KL divergence better than raw key MSE.
  • HeadQ recovers 84-94% of excess perplexity in 2-bit K-only decode with dense values.
  • HeadQ combined with A^2 value policy improves all six tested models in full-KV 2-bit setting.
  • Controls like null-space interventions, query-PCA, and wrong-sign corrections falsify pure storage-MSE approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The localization of main anomalies to low-entropy route-flip boundaries in small models could motivate entropy-aware or input-dependent bit allocation schemes.
  • If the calibration basis holds, the same visible-distortion principle might extend to compressing other attention components such as position encodings.
  • One could test whether score-space corrections remain effective in longer contexts or non-Pythia architectures beyond the six models studied.

Load-bearing premise

The calibration-learned query basis and low-rank residual side code generalize reliably across inputs, models, and tasks without retraining or significant degradation.

What would settle it

Observing that HeadQ fails to reduce perplexity or increases score error on a held-out model or task not used in calibration would falsify the generalization of the learned basis.

Figures

Figures reproduced from arXiv: 2605.03562 by Jorge L. Ruiz Williams.

Figure 1
Figure 1. Figure 1: Headline key-side payoff on the damage-bearing 2-bit K-only rows. view at source ↗
Figure 2
Figure 2. Figure 2: Matched low-entropy Pythia probes. The 160M anomaly is tied to an view at source ↗
Figure 2
Figure 2. Figure 2: Matched low-entropy Pythia probes. The 160M anomaly is tied to an view at source ↗
Figure 3
Figure 3. Figure 3: Softmax-null mean-key intervention at α = 1. Restoring an attention-inert common key component sharply increases Fisher damage for non-equivariant fixed-grid K quantizers, while the per-channel affine control re￾mains invariant. less reparameterization from a quantization failure: exact attention cancels the common component, but the quantizer may not. We isolate this with an α intervention, Kα = ∆K + α1¯k… view at source ↗
Figure 3
Figure 3. Figure 3: Softmax-null mean-key intervention at α = 1. Restoring an attention-inert common key component sharply increases Fisher damage for non-equivariant fixed-grid K quantizers, while the per-channel affine control re￾mains invariant. 5.2 Centering-Sensitive Quantizer Failure The theorem predicts a concrete failure mode for low-bit grids. A storage￾coordinate quantizer can fail in two steps: a token-shared commo… view at source ↗
Figure 4
Figure 4. Figure 4: Full-KV auxiliary composition on the damage-bearing 2-bit rows; view at source ↗
Figure 4
Figure 4. Figure 4: Full-KV quantizer-corrector codec composition on the damage-bearing view at source ↗
read the original abstract

KV-cache quantizers usually optimize storage-space reconstruction, even though attention reads keys through logits and values through attention-weighted readout. We argue that persistent cache error should be measured in model-visible coordinates. For keys, the visible object is score error modulo constant shifts; this yields HeadQ, a key-side method that stores a low-rank residual side code in a calibration-learned query basis and applies it as an additive logit correction. For values, fixed-attention readout gives an $A^2$-weighted token-distortion surrogate. Across six models, Fisher/score-space error predicts attention KL far better than raw key MSE; same-budget counterexamples, null-space interventions, query-PCA controls, and wrong-sign HeadQ falsify storage-MSE alternatives. Matched Pythia checkpoints localize the main anomaly to a small-model low-entropy route-flip boundary. In K-only WikiText-103 decode experiments with dense values, HeadQ removes roughly $84$--$94\%$ of the excess perplexity on the strongest 2-bit rows; in an auxiliary full-KV 2-bit composition, HeadQ plus an $A^2$ value policy improves all six models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes HeadQ, a KV-cache quantization method that measures persistent cache error in model-visible coordinates rather than storage-space reconstruction. For keys, score error is considered modulo constant shifts, leading to storage of a low-rank residual side code in a calibration-learned query basis that is applied as an additive logit correction. For values, an A²-weighted token-distortion surrogate is introduced. Controls including Fisher/score-space versus MSE comparisons, null-space interventions, query-PCA baselines, and wrong-sign ablations are used to show that score-space metrics better predict attention KL divergence. On six models with K-only 2-bit WikiText-103 decoding (dense values), HeadQ recovers 84-94% of excess perplexity; an auxiliary full-KV 2-bit setting with an A² value policy yields improvements across all models.

Significance. If the central claims hold, the work provides a practical and principled advance for memory-efficient LLM inference by aligning quantization objectives with how attention actually consumes the KV cache. The empirical recovery of most excess perplexity under tight 2-bit budgets is notable, and the suite of ablations strengthens the case against pure MSE-based alternatives. The calibration-learned components, while effective on the reported distribution, introduce a dependence on training data that could limit broader adoption unless generalization is demonstrated.

major comments (3)
  1. [Abstract] Abstract: the central claim that HeadQ removes 84--94% of excess perplexity on the strongest 2-bit rows is load-bearing for the paper's contribution, yet the abstract (and presumably the corresponding experimental table) provides no per-model breakdown, no definition of the exact baseline perplexity, and no indication of whether the calibration tokens used to learn the query basis are disjoint from the WikiText-103 evaluation set.
  2. [Method] Method section (description of query basis and low-rank residual): the query basis and side code are explicitly calibration-learned from data rather than derived from first principles; because the correction is a projection onto this fitted basis, any mismatch between calibration and decode distributions directly undermines the reported gains, yet no analysis of basis stability across sequence lengths or tasks is supplied.
  3. [Experiments] Experiments (six-model results): all reported gains reuse the same WikiText-103 calibration regime; without at least one cross-dataset or cross-task transfer experiment (e.g., on a different corpus or longer contexts), the generalization of the learned query basis remains untested and is therefore a load-bearing assumption for the claim that HeadQ is a general-purpose technique.
minor comments (2)
  1. [Abstract] Abstract: the term 'HeadQ side code' appears without an inline definition or pointer to the equation that defines it; a single-sentence gloss would improve readability for readers unfamiliar with the method.
  2. [Abstract] Abstract: the phrase 'matched Pythia checkpoints localize the main anomaly to a small-model low-entropy route-flip boundary' is opaque without a preceding definition of the anomaly or the route-flip phenomenon; a brief parenthetical would clarify the intended meaning.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below with clarifications drawn directly from the manuscript and indicate where revisions will strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that HeadQ removes 84--94% of excess perplexity on the strongest 2-bit rows is load-bearing for the paper's contribution, yet the abstract (and presumably the corresponding experimental table) provides no per-model breakdown, no definition of the exact baseline perplexity, and no indication of whether the calibration tokens used to learn the query basis are disjoint from the WikiText-103 evaluation set.

    Authors: The experimental table reports per-model perplexities for all six models, from which the 84--94% recovery range is computed as the min-max across those rows. The baseline is the perplexity obtained under standard 2-bit key quantization (MSE-optimal) with dense values. Calibration tokens are drawn from a held-out portion of the pre-training distribution and do not overlap with the WikiText-103 test split used for evaluation. We will revise the abstract to state these definitions explicitly and include a short per-model summary sentence. revision: yes

  2. Referee: [Method] Method section (description of query basis and low-rank residual): the query basis and side code are explicitly calibration-learned from data rather than derived from first principles; because the correction is a projection onto this fitted basis, any mismatch between calibration and decode distributions directly undermines the reported gains, yet no analysis of basis stability across sequence lengths or tasks is supplied.

    Authors: A data-driven basis is required because no closed-form derivation of the query directions that minimize score error exists for these models. The low-rank side code captures persistent, low-entropy error components observed in calibration; the method section already notes that calibration sequences are length-matched to evaluation. We agree that explicit stability analysis is absent and will add a short subsection reporting the effect of varying calibration sequence length (up to 2k tokens) on the learned basis and downstream perplexity. revision: partial

  3. Referee: [Experiments] Experiments (six-model results): all reported gains reuse the same WikiText-103 calibration regime; without at least one cross-dataset or cross-task transfer experiment (e.g., on a different corpus or longer contexts), the generalization of the learned query basis remains untested and is therefore a load-bearing assumption for the claim that HeadQ is a general-purpose technique.

    Authors: The six-model suite already spans different scales and families under a fixed calibration protocol, and the core score-space argument is independent of the particular corpus. We acknowledge that transfer to new tasks or longer contexts is not demonstrated and will expand the limitations paragraph to state this explicitly while outlining planned follow-up experiments. revision: partial

standing simulated objections not resolved
  • Empirical demonstration of query-basis transfer to entirely new corpora or substantially longer contexts.

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent validation

full rationale

The paper's core argument defines model-visible score-space error for keys and introduces HeadQ as a low-rank correction in a calibration-learned query basis. This is a design choice justified by the visible-error premise, not a self-referential definition. Experiments report perplexity recovery on WikiText-103 decode using the fitted basis, but include explicit falsification controls (null-space interventions, query-PCA, wrong-sign HeadQ) that test whether gains reduce to the calibration fit itself. No equation or claim equates the reported improvement to the calibration inputs by construction. The method is tested against storage-MSE baselines and across six models; calibration is a standard parameter-fitting step rather than a load-bearing self-citation or renamed known result. The derivation chain remains self-contained against the provided empirical benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on a domain assumption about visible error coordinates plus data-dependent calibration parameters; no new physical entities are postulated.

free parameters (2)
  • query basis
    Learned via calibration (likely PCA or similar) to define the low-rank residual space.
  • low-rank residual side code
    Fitted during calibration to provide the additive logit correction.
axioms (2)
  • domain assumption Persistent cache error should be measured in model-visible coordinates rather than storage space.
    Stated as the core argument motivating the entire approach.
  • domain assumption For keys the visible object is score error modulo constant shifts.
    Used to justify the logit-correction formulation of HeadQ.
invented entities (1)
  • HeadQ side code no independent evidence
    purpose: Low-rank residual stored alongside quantized keys for additive logit correction
    New component introduced by the method; no independent falsifiable evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5508 in / 1449 out tokens · 40150 ms · 2026-05-08T18:28:28.665818+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

26 extracted references · 19 canonical work pages · 8 internal anchors

  1. [1]

    Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners.OpenAI blog, 1(8), 9

  2. [2]

    Yang, A., Baheer, B., Chen, J., et al. (2024). Qwen2 technical report. arXiv preprint arXiv:2407.10671

  3. [3]

    Hui, B., Yang, J., Cui, Z., et al. (2024). Qwen2.5 technical report.arXiv preprint arXiv:2412.15115

  4. [4]

    Qwen Team. (2024). Introducing Qwen1.5. Retrieved from https://qwenlm.github.io/blog/qwen1.5/

  5. [5]

    Mistral 7B

    Jiang, A. Q., Sablayrolles, A., Mensch, A., et al. (2023). Mistral 7B.arXiv preprint arXiv:2310.06825

  6. [6]

    & Raff, E

    Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., O’Brien, K., Hallahan, E., ... & Raff, E. (2023). Pythia: A suite for analyzing large language models across training and scaling. InInternational Conference on Machine Learning(pp. 2397–2430). PMLR

  7. [7]

    Zhang, P., Zeng, G., Wang, T., & Lu, W. (2024). TinyLlama: An open-source small language model.arXiv preprint arXiv:2401.02385

  8. [8]

    Merity, S., Xiong, C., Bradbury, J., & Socher, R. (2016). Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843

  9. [9]

    Liu, Z., Yuan, B., Yin, H., Dong, P., Qin, H., & Yan, J. (2024). KIVI: A tuning-free asymmetric 2bit quantization for KV cache.arXiv preprint arXiv:2402.02750

  10. [10]

    W., Shao, Y

    Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y. S., Keutzer, K., & Gholami, A. (2024). KVQuant: Towards 10 million context length LLM inference with KV cache quantization.arXiv preprint arXiv:2401.18079

  11. [11]

    Gear: An efficient kv cache compression recipe for near-lossless generative inference of llm.arXiv preprint arXiv:2403.05527, 2024

    Kang, H., Zhang, Y., Dong, P., ... & Yan, J. (2024). GEAR: An efficient KV cache compression recipe for near-lossless generative inference of large language models.arXiv preprint arXiv:2403.05527

  12. [12]

    Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., ... & Re, C. (2023). H2O: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36

  13. [13]

    & Hooker, S

    Ashkboos, S., Ilharco, G., Ahn, M., ... & Hooker, S. (2024). QuaRot: Outlier-free 4-bit inference in rotated LLMs.arXiv preprint arXiv:2404.00456. 21

  14. [14]

    Zandieh, A., Daliri, M., Hadian, M., & Mirrokni, V. (2025). TurboQuant: Online vector quantization with near-optimal distortion rate.arXiv preprint arXiv:2504.19874

  15. [15]

    Hariri, M., Luo, A., Chen, W., Zhong, S., Zhang, T., Wang, Q., Hu, X., Han, X., Chaudhary, V., et al. (2025). Quantize What Counts: More for Keys, Less for Values.arXiv preprint arXiv:2502.15075

  16. [16]

    Tao, Q., Yu, W., & Zhou, J. (2024). AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations.arXiv preprint arXiv:2410.13212

  17. [17]

    Li, X., Xing, Z., Li, Y., Qu, L., Zhen, H.-L., Liu, W., Yao, Y., & Pan, S. J. (2025). KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference. arXiv preprint arXiv:2502.04420

  18. [18]

    Mao, W., Lin, X., Huang, W., Xie, Y., Fu, T., Zhuang, B., Han, S., & Chen, Y. (2026). TriAttention: Efficient long reasoning with trigonometric KV compression.arXiv preprint arXiv:2604.04921

  19. [19]

    Amari, S. (1998). Natural gradient works efficiently in learning.Neural Computation, 10(2), 251–276

  20. [20]

    Emadi, S. M. (2026). Exact Attention Sensitivity and the Geometry of Transformer Stability.arXiv preprint arXiv:2602.18849

  21. [21]

    Nishida, Y. (2026). AXELRAM: Quantize once, never dequantize.arXiv preprint arXiv:2604.02638

  22. [22]

    Su, Z., Chen, Z., Shen, W., Wei, H., Li, L., Yu, H., & Yuan, K. (2025). RotateKV: Accurate and robust 2-bit KV cache quantization for LLMs via outlier-aware adaptive rotations.arXiv preprint arXiv:2501.16383

  23. [23]

    Patel, I., & Joshi, I. (2026). PolyKV: A shared asymmetrically-compressed KV cache pool for multi-agent LLM inference. arXiv preprint arXiv:2604.24971

  24. [24]

    Karmore, A. (2026). LOOKAT: Lookup-Optimized Key-Attention for Memory-Efficient Transformers.arXiv preprint arXiv:2601.10155

  25. [25]

    Liu, Z., Zhao, C., Fedorov, I., Soran, B., Choudhary, D., Krishnamoorthi, R., Chandra, V., Tian, Y., & Blankevoort, T. (2025). SpinQuant: LLM quantization with learned rotations.International Conference on Learning Representations

  26. [26]

    Jegou, H., Douze, M., & Schmid, C. (2010). Product quantization for nearest neighbor search.IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1), 117–128. 22