pith. machine review for the scientific record. sign in

arxiv: 2605.08114 · v1 · submitted 2026-04-27 · 💻 cs.LG · cs.IT· cs.MS· math.IT

Recognition: 2 theorem links

· Lean Theorem

Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:17 UTC · model grok-4.3

classification 💻 cs.LG cs.ITcs.MSmath.IT
keywords KV cache quantizationattention KL divergenceJensen inequalitybit budgethyperspherical distributionkey value asymmetrytransformer inference
0
0 comments X

The pith

Applying milder quantization to keys than to values reduces KL divergence between reference and quantized attention at the 4-bit budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that at the 4-bit budget per vector, the KQV scheme outperforms both the scalar MSE baseline and the symmetric QKQV scheme on KL divergence, geometric key error, and 6D distance across tested distributions and ranks. A reader would care because KV cache quantization enables faster inference in transformer models, and the difference arises from how inner-product variance affects attention scores. By modeling vectors on the hypersphere with a Beta distribution, the work traces how a projection applied to keys inflates variance by a factor of pi over 2, which softmax then amplifies nonlinearly through Jensen's inequality. The K-V asymmetry holds unconditionally at every budget, while geometric reconstruction quality crosses over depending on exact bit width.

Core claim

Starting from the Beta distribution on the hypersphere, applying the advanced projection to keys inflates inner-product variance by pi/2, which the softmax amplifies superlinearly via Jensen's inequality under a sufficient condition the paper states. This mechanism produces higher KL divergence for symmetric quantization of both keys and values than for the asymmetric KQV scheme at n=4, where KQV wins on every metric. Empirical results across budgets show the crossover in geometric key error at n in {2,3,5} versus {4,6}, invariant to rank and tail weight, while KL remains lower for KQV at all points.

What carries the argument

The nonlinear amplification of key inner-product variance through the softmax operation when the advanced projection is applied to keys.

If this is right

  • KL divergence on attention scores directly connects key direction error to potential routing corruption and output collapse in the model.
  • The unconditional K-V asymmetry implies that quantizers should default to milder treatment of keys than values at every budget.
  • Geometric reconstruction quality crosses over with budget, so the optimal scheme depends on the target precision rather than being fixed.
  • At the practically dominant n=4 setting the Jensen mechanism dominates, explaining why symmetric projection on keys harms performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Quantizer designs could allocate different transforms to keys and values by default to respect their distinct roles in attention.
  • The observed rate-distortion crossover points to an optimization problem of choosing per-vector bit allocation rather than uniform budgets.
  • Validating the hyperspherical Beta model against statistics from real model weights would strengthen the statistical predictions for deployment.

Load-bearing premise

The Beta distribution on the hypersphere accurately captures the direction statistics of real key and value vectors from trained transformers.

What would settle it

Compute KL divergence on attention probability distributions from an actual trained transformer at n=4 using the KQV scheme versus the QKQV scheme and check whether the predicted elevation for QKQV appears.

Figures

Figures reproduced from arXiv: 2605.08114 by Paolo D'Alberto.

Figure 1
Figure 1. Figure 1: d = 2, Beta(0.5, 0.5) — arcsine (U-shaped). Top: rotated coordinate histograms (blue) vs. Beta density (red) for gaussian, heavy tail, low rank. Bottom: KL divergence boxplots, MSE 4-bit vs. MSE 3-bit + QJL 1-bit (equal budget). MSE wins across all distributions. Circles denote outliers beyond 1.5 × IQR (≈2.7σ for a Gaussian). 2 The TurboQuant Pipeline In transformer attention, the key and value tensors co… view at source ↗
Figure 2
Figure 2. Figure 2: d = 8, Beta(3.5, 3.5) — mild bell. Top: coordinate histograms begin to concentrate near zero; low rank (rank = 1) shows a nearly flat histogram, deviating from Beta — the first sign of joint-structure failure. Bottom: MSE wins across all distributions under equal budget. Circles denote outliers beyond 1.5 × IQR (≈2.7σ for a Gaussian). 4 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Jensen bridge — the K-driven causal chain at budget n = 4, fattail regime (ν = 3, log KL axis). Colour: KQV (red •), QKQV (purple ▲), KV (green ■). Left: three separated clouds in (ϵ dir K , KL) space. KQV sits leftmost (WHT + n-bit scalar on K minimises both); KV is far right (no WHT, heavy tails ⇒ large K direction error, KL up to 4 nats). Centre and right: all three clouds collapse onto the same monoton… view at source ↗
Figure 4
Figure 4. Figure 4: d = 128, Beta(63.5, 63.5) — deeply concentrated, nearly Gaussian. Top: all three dis￾tributions match the Beta curve closely; the subspace structure of low rank is already hidden in the marginal histogram. Bottom: MSE wins across all distributions under equal budget, consistent with the 2π variance penalty of QJL. Circles denote outliers beyond 1.5 × IQR (≈2.7σ for a Gaus￾sian). 13 [PITH_FULL_IMAGE:figure… view at source ↗
Figure 5
Figure 5. Figure 5: Shannon–φ plane comparisons (orange = KQV, green = KV; • K cache, ▲ V cache, ■ T output). Left: the near-Gaussian low-rank regime where the statistical framework is needed to detect the difference. Right: the extreme fat-tail regime where visual inspection suffices. Budget MW r (rank=32) 2 +1.000 KQV wins strongly 3 +0.700 KQV wins 4 +0.145 neutral 5 +0.958 KQV wins 6 −0.923 KV wins At B = 6, KV wins acros… view at source ↗
Figure 6
Figure 6. Figure 6: Same mode (ν = 1.05) at B = 4. With two additional bits KQV K (orange •) remains stable at ϵ dir K = 0.027; KV K retreats from 0.645 to 0.202 but the gap persists. Both V caches move toward the Lloyd-Max line as bit budget increases. The geometric separation shrinks but the ordering is unchanged: MW r = +1.000, energy distance 1.004. TurboQuant has no failure mode under fat tails. The search was honest and… view at source ↗
Figure 7
Figure 7. Figure 7: d = 1024, Beta(511.5, 511.5) — indistinguishable from N (0, 1/d). Top: all three dis￾tributions produce identical histograms matching the Beta curve — the rotation completely hides the original structure. Bottom: despite the perfect histogram match, low rank KL divergence is catastrophically large (∼100–101 ), exceeding heavy tail by one to two orders of magnitude. heavy tail is the only case where MSE+QJL… view at source ↗
read the original abstract

We analyse three KV cache quantization schemes under a fair bit budget: \textbf{KV} (scalar MSE baseline), \textbf{KQV} (WHT + MSE on $K$; WHT + MSE + QJL on $V$), and \textbf{QKQV} (WHT + MSE + QJL on both). Starting from the Beta distribution on the hypersphere, we trace how QJL on $K$ inflates inner product variance by $\pi/2$, which softmax amplifies nonlinearly via Jensen's inequality, and we present statistical inference and information metrics to highlight practical differences. Three empirical findings emerge. (1)~At $n=4$ (the practically dominant budget), KQV wins on every measure -- KL divergence, geometric $K$ error, and 6D distance -- across all distributions and ranks tested. (2)~The K--V asymmetry is unconditional: QKQV is consistently worse than KQV in KL divergence at every budget and distribution. (3)~A budget-dependent crossover exists: QKQV achieves better geometric $K$ reconstruction at $n \in \{2,3,5\}$, KQV at $n \in \{4,6\}$, invariant to rank and tail weight -- an open rate-distortion problem. $\mathrm{KL}(p_{\mathrm{ref}} \| p_{\mathrm{quant}})$, K-only by construction, bridges K direction error to routing corruption and output collapse. We present a sufficient condition when the Jensen mechanism amplifies superlinearly through the softmax. At $n \in \{2,3,5\}$, QKQV wins geometrically because this assumption does not bind. At $n=4$, elevated K error and KL divergence for QKQV strongly suggest the Jensen mechanism is the operative cause of the crossover, providing a new perspective and explanation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript analyzes three KV cache quantization schemes—KV (scalar MSE baseline), KQV (WHT + MSE on K; WHT + MSE + QJL on V), and QKQV (WHT + MSE + QJL on both)—under a fixed bit budget. Using vectors drawn from a Beta distribution on the hypersphere as a model for K and V, the authors derive that applying QJL to K inflates the variance of inner products by a factor of π/2. This inflation is then amplified nonlinearly by the softmax via Jensen's inequality. They provide a sufficient condition for superlinear amplification and report simulation results showing that at n=4 bits, KQV outperforms on KL divergence, geometric K error, and 6D distance across distributions and ranks; QKQV is consistently worse than KQV in KL divergence; and a budget-dependent crossover in geometric K reconstruction occurs, with QKQV better at n=2,3,5 and KQV at n=4,6.

Significance. If the Beta hypersphere model is a faithful representation of real transformer KV cache statistics, the work provides a mechanistic explanation for performance differences between quantization schemes and identifies the Jensen amplification as the cause of the crossover at n=4. It introduces statistical inference and quality measures (KL, geometric error, 6D distance) that link quantization error to potential routing corruption and output collapse. The parameter-free derivation of the π/2 variance factor and the sufficient condition for Jensen amplification are strengths, offering a new perspective on KV quantization design.

major comments (1)
  1. The central explanatory claim—that the budget-dependent crossover at n=4 is caused by the Jensen mechanism amplifying QJL-induced variance—depends on the Beta distribution on the hypersphere accurately modeling the angular statistics of real K and V vectors from trained transformers (see abstract and the section deriving the variance inflation and sufficient condition). Real KV caches typically exhibit low effective rank, K-V correlations, and non-uniform distributions that could alter inner-product variance or prevent the sufficient condition for superlinear softmax amplification from binding. Without empirical validation against actual model activations or a sensitivity analysis, the practical implications for transformer inference remain uncertain.
minor comments (2)
  1. The abstract mentions 'across all distributions and ranks tested' but does not specify the exact distributions, rank values, or number of trials; including these details or error bars would improve reproducibility and allow assessment of the consistency of the crossover.
  2. The term 'n=4' for bit budget is used without initial definition; clarifying that n refers to the number of bits per element in the quantization would aid readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting the importance of model fidelity. We address the major comment below by clarifying the scope of our theoretical analysis and offering a targeted revision.

read point-by-point responses
  1. Referee: The central explanatory claim—that the budget-dependent crossover at n=4 is caused by the Jensen mechanism amplifying QJL-induced variance—depends on the Beta distribution on the hypersphere accurately modeling the angular statistics of real K and V vectors from trained transformers (see abstract and the section deriving the variance inflation and sufficient condition). Real KV caches typically exhibit low effective rank, K-V correlations, and non-uniform distributions that could alter inner-product variance or prevent the sufficient condition for superlinear softmax amplification from binding. Without empirical validation against actual model activations or a sensitivity analysis, the practical implications for transformer inference remain uncertain.

    Authors: We agree that direct empirical validation on real transformer KV activations would strengthen the practical implications. Our manuscript is explicitly a model-based theoretical study: it begins from the Beta distribution on the hypersphere precisely because this distribution permits an analytic derivation of the π/2 variance inflation factor for inner products under QJL and yields a sufficient condition for superlinear amplification through the softmax via Jensen's inequality. All reported results (KL divergence, geometric K error, 6D distance, and the n=4 crossover) are obtained under controlled simulations within this framework, which isolates the effect of applying QJL to K versus V. The paper does not assert that the Beta hypersphere model replicates every statistic of real KV caches (e.g., low effective rank or K-V correlations); rather, it demonstrates that, whenever the sufficient condition binds, the Jensen mechanism produces the observed performance gap. In revision we will add an explicit Limitations subsection that (i) states the modeling assumptions, (ii) notes that real caches may modulate the variance inflation or prevent the sufficient condition from binding, and (iii) frames the n=4 crossover as a prediction of the model that can be tested on actual activations. This preserves the mechanistic contribution while acknowledging the referee's concern. revision: partial

Circularity Check

0 steps flagged

No significant circularity; results are simulation outcomes on an explicit synthetic model

full rationale

The paper begins with an explicit Beta distribution on the hypersphere as the generative model for K and V vectors. It derives the inner-product variance inflation of exactly π/2 from the interaction of this distribution with the QJL operator, then applies Jensen's inequality (a standard inequality) to obtain the sufficient condition for superlinear softmax amplification. All three headline empirical findings—KQV dominance at n=4 on KL/geometric/6D metrics, unconditional K-V asymmetry, and the budget-dependent crossover—are direct outputs of Monte-Carlo simulations run under these fixed distributional assumptions. No parameter is fitted to the target performance numbers, no result is renamed as a prediction, and no load-bearing premise rests on a self-citation whose content is itself unverified. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central analysis rests on modeling normalized K and V vectors with a Beta distribution on the hypersphere to derive the effect of QJL on inner products; this is a domain assumption rather than a derived result.

axioms (1)
  • domain assumption Normalized key and value vectors in attention follow a Beta distribution on the hypersphere.
    This modeling choice enables tracing how QJL quantization inflates inner-product variance by pi/2.

pith-pipeline@v0.9.0 · 5644 in / 1496 out tokens · 121168 ms · 2026-05-12T01:17:24.884686+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages

  1. [1]

    Turboquant: Online vector quantization with near-optimal distortion rate,

    Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. TurboQuant: Online vector quantization with near-optimal distortion rate. arXiv:2504.19874, 2025

  2. [2]

    QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead,

    Amir Zandieh, Majid Daliri, and Insu Han. QJL: 1-bit quantized JL transform for KV cache quantization with zero overhead. arXiv:2406.03482, 2024

  3. [3]

    Quantizing for minimum distortion.IRE Transactions on Information Theory, 6(1):7–12, 1960

    Joel Max. Quantizing for minimum distortion.IRE Transactions on Information Theory, 6(1):7–12, 1960

  4. [4]

    Stuart P. Lloyd. Least squares quantization in PCM.IEEE Transactions on Information Theory, 28(2):129–137, 1982

  5. [5]

    Johnson and Joram Lindenstrauss

    William B. Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space.Contemporary Mathematics, 26:189–206, 1984

  6. [6]

    Claude E. Shannon. A mathematical theory of communication.Bell System Technical Journal, 27(3):379–423, 1948. 22

  7. [7]

    Sz´ ekely and Maria L

    G´ abor J. Sz´ ekely and Maria L. Rizzo. Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference, 143(8):1249–1272, 2013

  8. [8]

    Belinda Phipson and Gordon K. Smyth. Permutationp-values should never be zero: Calculating exactp-values when permutations are randomly drawn.Statistical Applications in Genetics and Molecular Biology, 9(1), 2010

  9. [9]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. A Summary: Best Scheme by Regime and Budget Table 4 collects the best-performing scheme across all tested regimes and budgets, measu...