Recognition: no theorem link
AXELRAM: Quantize Once, Never Dequantize
Pith reviewed 2026-05-13 20:43 UTC · model grok-4.3
The pith
AXELRAM stores KV cache in quantized form and computes attention scores via table lookup without any dequantization step.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By applying an orthogonal transform on write and performing table lookup on read with no inverse transform, attention scores can be computed directly from the quantized KV cache indices; the orthogonal transform concentrates each coordinate to a distribution N(0, 1/d), so the optimal quantizer is fixed once dimension d and bit-width b are known and does not depend on the input data.
What carries the argument
Fixed codebook arising from orthogonal-transform quantization, used in an asymmetric path where the transform occurs only on write and read uses direct table lookup without inversion.
If this is right
- Per-query attention computation requires only index lookups instead of full matrix multiplies after the initial quantization.
- The same codebook works for any input once d and b are fixed, removing the need for per-sequence calibration of the quantizer.
- Sign-pattern selection using 200 candidates and 8 calibration samples removes catastrophic perplexity spikes with no extra hardware.
- The architecture applies uniformly to LLaMA-3.1-8B and similar stable models after the one-time sign fix.
Where Pith is reading between the lines
- The same fixed-codebook idea could be tested on other attention variants such as grouped-query attention without changing the SRAM macro.
- If the concentration property holds for higher bit widths, the design might extend to mixed-precision KV caches with the same lookup table.
- Layer-wise norm heterogeneity identified as the root of sign sensitivity suggests a simple norm-based pre-check could further reduce the 200-candidate search.
Load-bearing premise
Orthogonal transforms always concentrate the per-coordinate distribution tightly enough around N(0, 1/d) that the best quantizer becomes independent of the particular input sequence.
What would settle it
Measure the actual multiplication count and perplexity when the fixed codebook is applied to a new model and seed; if the multiplication reduction falls below 50x or perplexity spikes exceed 50 without the sign-selection step, the central claim does not hold.
Figures
read the original abstract
We propose AXELRAM, a smart SRAM macro architecture that computes attention scores directly from quantized KV cache indices without dequantization. The key enabler is a design-time fixed codebook: orthogonal-transform-based quantization concentrates each coordinate's distribution to N(0,1/d), so the optimal quantizer depends only on dimension d and bit-width b, not on input data. The asymmetric path design -- transform on write, table-lookup on read with no inverse transform -- reduces per-query multiplications by 102.4x (a mathematical identity). Through multi-seed evaluation (10 seeds x 3 models), we discover that sign pattern sensitivity causes catastrophic PPL spikes (Delta > 50) on certain models (Qwen2.5-3B), while others (LLaMA-3.1-8B) are fully stable. This phenomenon extends SpinQuant's observation of rotation variance in weight quantization to the KV cache domain, where the effect is qualitatively more severe. We trace the root cause to layer-wise norm heterogeneity and propose a gradient-free sign pattern selection (200 candidates, 8 calibration samples, one-time) that eliminates catastrophic spikes with zero additional hardware cost. All source code is available at https://github.com/Axelidea/AXELRAM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes AXELRAM, an SRAM macro architecture that computes attention scores directly from quantized KV cache indices without dequantization. It relies on orthogonal-transform-based quantization to produce a design-time fixed codebook depending only on dimension d and bit-width b. The asymmetric path (transform on write, table-lookup on read) is claimed to reduce per-query multiplications by 102.4x as a mathematical identity. Multi-seed experiments across models reveal sign-pattern sensitivity causing large PPL spikes on some models (e.g., Qwen2.5-3B), which is mitigated by a one-time gradient-free search over 200 sign patterns using 8 calibration samples.
Significance. If the multiplication reduction holds under hardware mapping and the calibration fix proves robust, the approach could enable substantial efficiency gains in LLM inference by eliminating dequantization overhead in KV cache access. The identification of sign-pattern sensitivity in the KV domain and its link to layer-wise norm heterogeneity extends prior observations from weight quantization and provides a low-overhead practical remedy.
major comments (2)
- [Abstract] Abstract: The central claim that 'the optimal quantizer depends only on dimension d and bit-width b, not on input data' because orthogonal transforms concentrate coordinates to N(0,1/d) is directly contradicted by the reported sign-pattern sensitivity. The manuscript shows catastrophic PPL spikes (Delta > 50) on Qwen2.5-3B due to layer-wise norm heterogeneity, requiring a model-specific search over 200 candidates with 8 calibration samples. This indicates the effective post-transform distribution remains dependent on model statistics, undermining the premise of a fully data-independent fixed codebook.
- [Abstract] Abstract: The 102.4x multiplication reduction is asserted as a mathematical identity from the asymmetric path design, yet the manuscript provides no explicit derivation, error-propagation analysis, or hardware-mapping details to confirm that table-lookup fully eliminates multiplications even under the observed norm heterogeneity and sign-pattern effects.
minor comments (1)
- [Evaluation] The multi-seed protocol (10 seeds x 3 models) is a positive aspect for demonstrating the sign-sensitivity phenomenon, but the manuscript should clarify how the 8 calibration samples were selected and whether they generalize across layers with varying norms.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, clarifying the distinction between the data-independent quantizer and the transform calibration while agreeing to strengthen the manuscript with additional derivations.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'the optimal quantizer depends only on dimension d and bit-width b, not on input data' because orthogonal transforms concentrate coordinates to N(0,1/d) is directly contradicted by the reported sign-pattern sensitivity. The manuscript shows catastrophic PPL spikes (Delta > 50) on Qwen2.5-3B due to layer-wise norm heterogeneity, requiring a model-specific search over 200 candidates with 8 calibration samples. This indicates the effective post-transform distribution remains dependent on model statistics, undermining the premise of a fully data-independent fixed codebook.
Authors: The data-independence claim applies specifically to the quantization codebook (bin boundaries and reconstruction levels), which are fixed from the N(0,1/d) concentration property and do not require input statistics. The sign-pattern sensitivity is a distinct issue in selecting the orthogonal transform matrix itself, driven by layer-wise norm heterogeneity in the KV cache; this is mitigated by the low-cost gradient-free search described in the manuscript. The quantizer remains data-independent even after calibration. We will revise the abstract and introduction to explicitly separate these two elements and avoid ambiguity. revision: yes
-
Referee: [Abstract] Abstract: The 102.4x multiplication reduction is asserted as a mathematical identity from the asymmetric path design, yet the manuscript provides no explicit derivation, error-propagation analysis, or hardware-mapping details to confirm that table-lookup fully eliminates multiplications even under the observed norm heterogeneity and sign-pattern effects.
Authors: We agree that an explicit derivation is missing and will add it. The reduction is a direct consequence of the asymmetric design: the orthogonal transform is applied once on write, after which attention scores are obtained via table lookup on the fixed codebook indices, replacing the d multiplications of a standard dot product with index-based lookups and additions. Norm heterogeneity is handled by the sign calibration without reintroducing multiplications. We will include a step-by-step mathematical derivation, error bounds, and a high-level hardware mapping discussion in the revised methods section. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper presents the 102.4x multiplication reduction explicitly as a mathematical identity arising from the asymmetric design (transform on write, table-lookup on read, no inverse). The distribution concentration claim is stated as a property of orthogonal transforms rather than derived from the paper's own fitted results or outputs. Sign-pattern selection is performed via a separate gradient-free search over calibration samples and is not renamed as a prediction or forced by self-definition. No step equates a claimed result to its inputs by construction, and the central claims remain independent of any self-citation chain or ansatz smuggling.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Orthogonal-transform-based quantization concentrates each coordinate's distribution to N(0,1/d) independent of input data
Forward citations
Cited by 2 Pith papers
-
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.
-
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
HeadQ reduces 84-94% of excess perplexity in 2-bit key quantization by adding low-rank logit corrections in a calibration-learned query basis, with further gains from an A^2-weighted value policy.
Reference graph
Works this paper leans on
-
[1]
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate,
A. Zandieh, M. Braverman, and A. Karbasi, “TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate,” inProc. ICLR, 2026
work page 2026
-
[2]
SpinQuant: LLM Quantization with Learned Rotations,
L. Liu, Z. Hu, Y . Zhu, and C. De Sa, “SpinQuant: LLM Quantization with Learned Rotations,” inProc. ICLR, 2025
work page 2025
-
[3]
ParoQuant: Pairwise Rotation Quantization,
Y . Liang, H. Chen, Z. Zhang, S. Han, and Z. Liu, “ParoQuant: Pairwise Rotation Quantization,” inProc. ICLR, 2026
work page 2026
-
[4]
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs,
S. Ashkboos et al., “QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs,” inProc. NeurIPS, 2024
work page 2024
-
[5]
QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks,
J. Chee et al., “QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks,” inProc. ICML, 2024
work page 2024
-
[6]
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache,
Z. Liu et al., “KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache,” inProc. ICML, 2024
work page 2024
-
[7]
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead,
R. Mao et al., “QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead,” arXiv:2406.03482, 2024
-
[8]
LOOKAT: Lookup-Optimized Key-Attention for Memory- Efficient Transformers,
A. Karmore, “LOOKAT: Lookup-Optimized Key-Attention for Memory- Efficient Transformers,” arXiv:2601.10155, 2026
-
[9]
Multiplying Matrices Without Multiplying,
D. Blalock and J. Guttag, “Multiplying Matrices Without Multiplying,” inProc. ICML, 2021
work page 2021
-
[10]
Accelerator Architecture For A Transformer Machine Learning Model,
W. Lu, Y . Wu, and Z. Wang, “Accelerator Architecture For A Transformer Machine Learning Model,” US Patent Application US20250028563A1, 2025
work page 2025
-
[11]
KVLinC: Hadamard Rotation with Linear Correction for KV Cache Quantization,
Y . Wang et al., “KVLinC: Hadamard Rotation with Linear Correction for KV Cache Quantization,” arXiv:2510.05373, 2025
-
[12]
PolarQuant: Polar-Coordinate KV Cache Quantization,
Z. Chen et al., “PolarQuant: Polar-Coordinate KV Cache Quantization,” arXiv:2502.02617, 2025
-
[13]
Product Quantization for Nearest Neighbor Search,
H. J ´egou, M. Douze, and C. Schmid, “Product Quantization for Nearest Neighbor Search,”IEEE TPAMI, 2011
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.