arxiv: 2604.02638 · v1 · submitted 2026-04-03 · 💻 cs.LG · cs.AR

Recognition: no theorem link

AXELRAM: Quantize Once, Never Dequantize

Yasushi Nishida

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AR

keywords KV cache quantizationattention computationSRAM macroorthogonal transformdequantization avoidancesign pattern calibrationtable lookup

0 comments

The pith

AXELRAM stores KV cache in quantized form and computes attention scores via table lookup without any dequantization step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an SRAM architecture that keeps key-value caches quantized after a single orthogonal transform at write time. Attention scores are then obtained by indexing into a fixed codebook on read, eliminating the need to reconstruct full-precision values. This asymmetric design turns the mathematical structure of the transform into a direct lookup operation. The approach cuts the number of multiplications per query by a factor of 102.4 while using a codebook that depends only on dimension and bit width. A one-time gradient-free calibration step selects sign patterns to prevent large perplexity spikes on sensitive models.

Core claim

By applying an orthogonal transform on write and performing table lookup on read with no inverse transform, attention scores can be computed directly from the quantized KV cache indices; the orthogonal transform concentrates each coordinate to a distribution N(0, 1/d), so the optimal quantizer is fixed once dimension d and bit-width b are known and does not depend on the input data.

What carries the argument

Fixed codebook arising from orthogonal-transform quantization, used in an asymmetric path where the transform occurs only on write and read uses direct table lookup without inversion.

If this is right

Per-query attention computation requires only index lookups instead of full matrix multiplies after the initial quantization.
The same codebook works for any input once d and b are fixed, removing the need for per-sequence calibration of the quantizer.
Sign-pattern selection using 200 candidates and 8 calibration samples removes catastrophic perplexity spikes with no extra hardware.
The architecture applies uniformly to LLaMA-3.1-8B and similar stable models after the one-time sign fix.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fixed-codebook idea could be tested on other attention variants such as grouped-query attention without changing the SRAM macro.
If the concentration property holds for higher bit widths, the design might extend to mixed-precision KV caches with the same lookup table.
Layer-wise norm heterogeneity identified as the root of sign sensitivity suggests a simple norm-based pre-check could further reduce the 200-candidate search.

Load-bearing premise

Orthogonal transforms always concentrate the per-coordinate distribution tightly enough around N(0, 1/d) that the best quantizer becomes independent of the particular input sequence.

What would settle it

Measure the actual multiplication count and perplexity when the fixed codebook is applied to a new model and seed; if the multiplication reduction falls below 50x or perplexity spikes exceed 50 without the sign-selection step, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2604.02638 by Yasushi Nishida.

**Figure 2.** Figure 2: AXELRAM smart SRAM macro. Write path (left): norm extraction, FWHT butterfly network (448 add/sub, zero multipliers), Lloyd-Max comparator quantization (896 comparators), writing 3-bit indices + FP16 norm to SRAM. Read path (right): pre-computed table lookup (128 parallel reads), adder tree (127 adders), norm scaling (1 multiplication). Pre-computation (top center, once per query): table generation with 10… view at source ↗

**Figure 3.** Figure 3: Read path detail. Phase 1 (once per query): FWHT rotation of query, table generation ( [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

We propose AXELRAM, a smart SRAM macro architecture that computes attention scores directly from quantized KV cache indices without dequantization. The key enabler is a design-time fixed codebook: orthogonal-transform-based quantization concentrates each coordinate's distribution to N(0,1/d), so the optimal quantizer depends only on dimension d and bit-width b, not on input data. The asymmetric path design -- transform on write, table-lookup on read with no inverse transform -- reduces per-query multiplications by 102.4x (a mathematical identity). Through multi-seed evaluation (10 seeds x 3 models), we discover that sign pattern sensitivity causes catastrophic PPL spikes (Delta > 50) on certain models (Qwen2.5-3B), while others (LLaMA-3.1-8B) are fully stable. This phenomenon extends SpinQuant's observation of rotation variance in weight quantization to the KV cache domain, where the effect is qualitatively more severe. We trace the root cause to layer-wise norm heterogeneity and propose a gradient-free sign pattern selection (200 candidates, 8 calibration samples, one-time) that eliminates catastrophic spikes with zero additional hardware cost. All source code is available at https://github.com/Axelidea/AXELRAM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes AXELRAM, an SRAM macro architecture that computes attention scores directly from quantized KV cache indices without dequantization. It relies on orthogonal-transform-based quantization to produce a design-time fixed codebook depending only on dimension d and bit-width b. The asymmetric path (transform on write, table-lookup on read) is claimed to reduce per-query multiplications by 102.4x as a mathematical identity. Multi-seed experiments across models reveal sign-pattern sensitivity causing large PPL spikes on some models (e.g., Qwen2.5-3B), which is mitigated by a one-time gradient-free search over 200 sign patterns using 8 calibration samples.

Significance. If the multiplication reduction holds under hardware mapping and the calibration fix proves robust, the approach could enable substantial efficiency gains in LLM inference by eliminating dequantization overhead in KV cache access. The identification of sign-pattern sensitivity in the KV domain and its link to layer-wise norm heterogeneity extends prior observations from weight quantization and provides a low-overhead practical remedy.

major comments (2)

[Abstract] Abstract: The central claim that 'the optimal quantizer depends only on dimension d and bit-width b, not on input data' because orthogonal transforms concentrate coordinates to N(0,1/d) is directly contradicted by the reported sign-pattern sensitivity. The manuscript shows catastrophic PPL spikes (Delta > 50) on Qwen2.5-3B due to layer-wise norm heterogeneity, requiring a model-specific search over 200 candidates with 8 calibration samples. This indicates the effective post-transform distribution remains dependent on model statistics, undermining the premise of a fully data-independent fixed codebook.
[Abstract] Abstract: The 102.4x multiplication reduction is asserted as a mathematical identity from the asymmetric path design, yet the manuscript provides no explicit derivation, error-propagation analysis, or hardware-mapping details to confirm that table-lookup fully eliminates multiplications even under the observed norm heterogeneity and sign-pattern effects.

minor comments (1)

[Evaluation] The multi-seed protocol (10 seeds x 3 models) is a positive aspect for demonstrating the sign-sensitivity phenomenon, but the manuscript should clarify how the 8 calibration samples were selected and whether they generalize across layers with varying norms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, clarifying the distinction between the data-independent quantizer and the transform calibration while agreeing to strengthen the manuscript with additional derivations.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'the optimal quantizer depends only on dimension d and bit-width b, not on input data' because orthogonal transforms concentrate coordinates to N(0,1/d) is directly contradicted by the reported sign-pattern sensitivity. The manuscript shows catastrophic PPL spikes (Delta > 50) on Qwen2.5-3B due to layer-wise norm heterogeneity, requiring a model-specific search over 200 candidates with 8 calibration samples. This indicates the effective post-transform distribution remains dependent on model statistics, undermining the premise of a fully data-independent fixed codebook.

Authors: The data-independence claim applies specifically to the quantization codebook (bin boundaries and reconstruction levels), which are fixed from the N(0,1/d) concentration property and do not require input statistics. The sign-pattern sensitivity is a distinct issue in selecting the orthogonal transform matrix itself, driven by layer-wise norm heterogeneity in the KV cache; this is mitigated by the low-cost gradient-free search described in the manuscript. The quantizer remains data-independent even after calibration. We will revise the abstract and introduction to explicitly separate these two elements and avoid ambiguity. revision: yes
Referee: [Abstract] Abstract: The 102.4x multiplication reduction is asserted as a mathematical identity from the asymmetric path design, yet the manuscript provides no explicit derivation, error-propagation analysis, or hardware-mapping details to confirm that table-lookup fully eliminates multiplications even under the observed norm heterogeneity and sign-pattern effects.

Authors: We agree that an explicit derivation is missing and will add it. The reduction is a direct consequence of the asymmetric design: the orthogonal transform is applied once on write, after which attention scores are obtained via table lookup on the fixed codebook indices, replacing the d multiplications of a standard dot product with index-based lookups and additions. Norm heterogeneity is handled by the sign calibration without reintroducing multiplications. We will include a step-by-step mathematical derivation, error bounds, and a high-level hardware mapping discussion in the revised methods section. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents the 102.4x multiplication reduction explicitly as a mathematical identity arising from the asymmetric design (transform on write, table-lookup on read, no inverse). The distribution concentration claim is stated as a property of orthogonal transforms rather than derived from the paper's own fitted results or outputs. Sign-pattern selection is performed via a separate gradient-free search over calibration samples and is not renamed as a prediction or forced by self-definition. No step equates a claimed result to its inputs by construction, and the central claims remain independent of any self-citation chain or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that orthogonal transforms produce coordinate-wise N(0,1/d) distributions independent of data, allowing a design-time fixed codebook; no free parameters are fitted to target accuracy, and no new physical entities are postulated.

axioms (1)

domain assumption Orthogonal-transform-based quantization concentrates each coordinate's distribution to N(0,1/d) independent of input data
Invoked to justify that the optimal quantizer depends only on d and b

pith-pipeline@v0.9.0 · 5523 in / 1284 out tokens · 39364 ms · 2026-05-13T20:43:27.646948+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
cs.LG 2026-05 conditional novelty 8.0

HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
cs.LG 2026-05 unverdicted novelty 6.0

HeadQ reduces 84-94% of excess perplexity in 2-bit key quantization by adding low-rank logit corrections in a calibration-learned query basis, with further gains from an A^2-weighted value policy.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 1 Pith paper

[1]

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate,

A. Zandieh, M. Braverman, and A. Karbasi, “TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate,” inProc. ICLR, 2026

work page 2026
[2]

SpinQuant: LLM Quantization with Learned Rotations,

L. Liu, Z. Hu, Y . Zhu, and C. De Sa, “SpinQuant: LLM Quantization with Learned Rotations,” inProc. ICLR, 2025

work page 2025
[3]

ParoQuant: Pairwise Rotation Quantization,

Y . Liang, H. Chen, Z. Zhang, S. Han, and Z. Liu, “ParoQuant: Pairwise Rotation Quantization,” inProc. ICLR, 2026

work page 2026
[4]

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs,

S. Ashkboos et al., “QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs,” inProc. NeurIPS, 2024

work page 2024
[5]

QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks,

J. Chee et al., “QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks,” inProc. ICML, 2024

work page 2024
[6]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache,

Z. Liu et al., “KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache,” inProc. ICML, 2024

work page 2024
[7]

QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead,

R. Mao et al., “QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead,” arXiv:2406.03482, 2024

work page arXiv 2024
[8]

LOOKAT: Lookup-Optimized Key-Attention for Memory- Efficient Transformers,

A. Karmore, “LOOKAT: Lookup-Optimized Key-Attention for Memory- Efficient Transformers,” arXiv:2601.10155, 2026

work page arXiv 2026
[9]

Multiplying Matrices Without Multiplying,

D. Blalock and J. Guttag, “Multiplying Matrices Without Multiplying,” inProc. ICML, 2021

work page 2021
[10]

Accelerator Architecture For A Transformer Machine Learning Model,

W. Lu, Y . Wu, and Z. Wang, “Accelerator Architecture For A Transformer Machine Learning Model,” US Patent Application US20250028563A1, 2025

work page 2025
[11]

KVLinC: Hadamard Rotation with Linear Correction for KV Cache Quantization,

Y . Wang et al., “KVLinC: Hadamard Rotation with Linear Correction for KV Cache Quantization,” arXiv:2510.05373, 2025

work page arXiv 2025
[12]

PolarQuant: Polar-Coordinate KV Cache Quantization,

Z. Chen et al., “PolarQuant: Polar-Coordinate KV Cache Quantization,” arXiv:2502.02617, 2025

work page arXiv 2025
[13]

Product Quantization for Nearest Neighbor Search,

H. J ´egou, M. Douze, and C. Schmid, “Product Quantization for Nearest Neighbor Search,”IEEE TPAMI, 2011

work page 2011