Fast-TurboQuant: A Multiplier-Free Online Vector Quantization Approach
Pith reviewed 2026-06-26 14:37 UTC · model grok-4.3
The pith
Fast-TurboQuant replaces dense random rotations with a Rademacher phase inversion and fast Walsh-Hadamard transform to enable multiplier-free vector quantization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fast-TurboQuant substitutes the dense random rotation matrix of TurboQuant with a Rademacher phase inversion followed by a fast Walsh-Hadamard transform. The structured transform satisfies the sub-Gaussian concentration requirements for Lloyd-Max quantization, reducing arithmetic operations to additions alone. On DBpedia OpenAI-3 Large embeddings, this yields a 19.7 times algorithmic speedup in sequential execution, lower mean squared error, and improved Recall@10 from the dimension expansion.
What carries the argument
Rademacher phase inversion followed by fast Walsh-Hadamard transform (FWHT) as a multiplier-free structured Johnson-Lindenstrauss transform that conditions vector distributions for scalar quantization.
Load-bearing premise
The sub-Gaussian concentration properties of the Rademacher-plus-FWHT transform are sufficient to replace the dense random rotation matrix while preserving downstream quantization error and recall performance.
What would settle it
If the mean squared quantization error or Recall@10 on a new set of embeddings is higher with Fast-TurboQuant than with the original dense-matrix TurboQuant, the performance-preservation claim would be falsified.
Figures
read the original abstract
As large language models scale, memory bandwidth for key-value caches and retrieval-augmented generation systems becomes a critical bottleneck. While 1-bit quantization addresses this constraint, recent TurboQuant relies on dense random rotation matrices to condition the vector distribution before quantization. This projection demands millions of floating-point multiplications per embedding, making it difficult to deploy on constrained edge silicon. We introduce Fast-TurboQuant, a multiplier-free projection architecture that replaces the dense matrix with a structured fast Johnson-Lindenstrauss transform. By applying a Rademacher phase inversion followed by a fast Walsh-Hadamard transform (FWHT), the method leverages sub-Gaussian concentration to satisfy the prerequisites of scalar Lloyd-Max quantization without Gaussian projections. This substitution reduces the arithmetic complexity to only additions, eliminating hardware multipliers. Evaluation on DBpedia OpenAI-3 Large embeddings demonstrates a 19.7 times algorithmic speedup under sequential execution. Furthermore, the dimension expansion due to the FWHT zero-padding reduces the mean squared error and improves Recall@10.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Fast-TurboQuant, which replaces the dense random rotation matrix of TurboQuant with a multiplier-free structured projection consisting of Rademacher sign flips followed by a fast Walsh-Hadamard transform (FWHT) with zero-padding. It claims that this transform satisfies the sub-Gaussian concentration prerequisites for scalar Lloyd-Max quantization, yielding only additions, a 19.7× algorithmic speedup on sequential execution, lower MSE, and higher Recall@10 on DBpedia OpenAI-3 Large embeddings due to the dimension expansion.
Significance. If the sub-Gaussian tail properties and isotropy of the structured transform are shown to match those of the dense random orthogonal matrix sufficiently for Lloyd-Max error bounds, the multiplier-free design would be a practical advance for deploying vector quantization on edge hardware without floating-point multipliers. The explicit credit for reducing arithmetic to additions and the reported recall improvement via zero-padding are strengths, but the absence of any tail-bound verification or ablation against the dense baseline limits the current impact.
major comments (2)
- [Abstract] Abstract: The central claim that 'the method leverages sub-Gaussian concentration to satisfy the prerequisites of scalar Lloyd-Max quantization without Gaussian projections' is load-bearing for all performance attributions, yet the manuscript provides neither a derivation of the coordinate-wise moment conditions after FWHT nor any empirical comparison (e.g., kurtosis, tail quantiles, or per-coordinate quantization MSE) against the dense random rotation of the original TurboQuant.
- [Evaluation] Evaluation section (implied by the DBpedia results): The reported 19.7× speedup and Recall@10 gains are presented without error bars, ablation tables isolating the effect of the structured transform versus dimension expansion, or direct head-to-head quantization-error measurements on identical embeddings before and after the Rademacher+FWHT step; this prevents confirmation that the downstream metrics are attributable to the multiplier-free projection rather than the padding.
minor comments (1)
- [Abstract] Notation for the FWHT zero-padding factor and the exact dimension after expansion is not defined in the abstract or early sections, making it difficult to reproduce the claimed MSE reduction.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's comments. We appreciate the feedback highlighting areas where additional theoretical and empirical support would strengthen the manuscript. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'the method leverages sub-Gaussian concentration to satisfy the prerequisites of scalar Lloyd-Max quantization without Gaussian projections' is load-bearing for all performance attributions, yet the manuscript provides neither a derivation of the coordinate-wise moment conditions after FWHT nor any empirical comparison (e.g., kurtosis, tail quantiles, or per-coordinate quantization MSE) against the dense random rotation of the original TurboQuant.
Authors: We concur that an explicit derivation of the coordinate-wise moment conditions and tail bounds for the Rademacher sign flips followed by FWHT would provide stronger justification for the sub-Gaussian concentration claim. While the structured fast Johnson-Lindenstrauss transform is established in the literature to preserve sub-Gaussian properties, we will add a dedicated subsection deriving these properties specifically for our setting. We will also include empirical plots comparing kurtosis, tail quantiles, and per-coordinate quantization MSE between the structured transform and the dense random rotation on the DBpedia embeddings. These additions will be incorporated in the revised manuscript. revision: yes
-
Referee: [Evaluation] Evaluation section (implied by the DBpedia results): The reported 19.7× speedup and Recall@10 gains are presented without error bars, ablation tables isolating the effect of the structured transform versus dimension expansion, or direct head-to-head quantization-error measurements on identical embeddings before and after the Rademacher+FWHT step; this prevents confirmation that the downstream metrics are attributable to the multiplier-free projection rather than the padding.
Authors: We agree that the current presentation lacks error bars and ablations, which limits the ability to attribute improvements precisely. The speedup figure is from sequential execution timing, and the recall improvement is linked to dimension expansion via zero-padding as noted. To address this, we will add error bars based on multiple runs for the timing and Recall@10 metrics. We will also include an ablation study with tables showing quantization MSE and recall for: (1) original TurboQuant dense projection, (2) our structured projection without padding, (3) with padding. This will isolate the contributions of the multiplier-free transform versus the dimension increase. These revisions will be made to the Evaluation section. revision: yes
Circularity Check
No significant circularity; derivation relies on external transform properties and empirical validation
full rationale
The paper proposes replacing dense random rotations with a Rademacher sign-flip plus FWHT structured transform, invoking sub-Gaussian concentration to justify compatibility with Lloyd-Max quantization, then reports empirical speedups and recall gains on DBpedia embeddings. No equations define a fitted parameter that is subsequently renamed as a prediction, no self-citation chain bears the central claim, and no result reduces by construction to quantities defined inside the paper. The speedup (additions only) and MSE improvement (from zero-padding) are measured outcomes on held-out data, not tautological restatements of the method's definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Rademacher random variables combined with the fast Walsh-Hadamard transform produce sub-Gaussian vectors whose concentration is adequate for Lloyd-Max scalar quantization.
Reference graph
Works this paper leans on
-
[1]
Pqcache: Product quantization-based kvcache for long context llm inference,
H. Zhang, X. Ji, Y . Chen, F. Fu, X. Miao, X. Nie, W. Chen, and B. Cui, “Pqcache: Product quantization-based kvcache for long context llm inference,” inProc. ACM on Management of Data, vol. 3, no. 3. ACM New York, NY , USA, 2025, pp. 1–30
2025
-
[2]
Turbo- quant: 1-bit similarity estimation for llm inference,
A. Zandieh, M. Daliri, M. Hadian, and V . Mirrokni, “Turbo- quant: 1-bit similarity estimation for llm inference,”arXiv preprint arXiv:2504.19874, 2025
Pith/arXiv arXiv 2025
-
[3]
M. Xiang, B. Wang, and T. Luo, “Orpquant: Geometric orthogonal residual projection for multiplier-free power-of-two transformer quan- tization,”arXiv preprint arXiv:2605.26092, 2026
Pith/arXiv arXiv 2026
-
[4]
Shiftaddllm: Accelerating pretrained llms via post-training multiplication-less reparameterization,
H. You, Y . Guo, Y . Fu, W. Zhou, H. Shi, X. Zhang, S. Kundu, A. Yazdanbakhsh, and Y . C. Lin, “Shiftaddllm: Accelerating pretrained llms via post-training multiplication-less reparameterization,”Advances in Neural Information Processing Systems, vol. 37, pp. 24 822–24 848, 2024
2024
-
[5]
Spinquant: Llm quantization with learned rotations,
Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V . Chandra, Y . Tian, and T. Blankevoort, “Spinquant: Llm quantization with learned rotations,” inInternational Conference on Learning Rep- resentations, vol. 2025, 2025, pp. 92 009–92 032
2025
-
[6]
Quarot: Outlier-free 4-bit inference in rotated llms,
S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman, “Quarot: Outlier-free 4-bit inference in rotated llms,”Advances in Neural Information Processing Systems, vol. 37, pp. 100 213–100 240, 2024
2024
-
[7]
Polarquant: Optimal gaussian weight quantiza- tion via hadamard rotation for llm compression,
C. Vicentino, “Polarquant: Optimal gaussian weight quantiza- tion via hadamard rotation for llm compression,”arXiv preprint arXiv:2603.29078, 2026
Pith/arXiv arXiv 2026
-
[8]
Kvlinc: Kv cache quantization with hadamard rotation and linear correction,
U. Saxena and K. Roy, “Kvlinc: Kv cache quantization with hadamard rotation and linear correction,”arXiv preprint arXiv:2510.05373, 2025
arXiv 2025
-
[9]
Prov- able quantization with randomized hadamard transform,
Y . Feng, P. Indyk, M. Kapralov, D. Krachun, and B. Prokhorov, “Prov- able quantization with randomized hadamard transform,”arXiv preprint arXiv:2605.13810, 2026
Pith/arXiv arXiv 2026
-
[10]
E. J. Yoon, “Itq3 s: High-fidelity 3-bit llm inference via interleaved ternary quantization with rotation-domain smoothing,”arXiv preprint arXiv:2603.27914, 2026
arXiv 2026
-
[11]
Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search,
J. Gao and C. Long, “Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search,” Proceedings of the ACM on Management of Data, vol. 2, no. 3, pp. 1–27, 2024
2024
-
[12]
Revisiting rabitq and turboquant: A symmetric comparison of methods, theory, and experiments,
J. Gao, Y . Gou, Y . Xu, J. Shi, Y . Yang, S. Li, R. C.-W. Wong, and C. Long, “Revisiting rabitq and turboquant: A symmetric comparison of methods, theory, and experiments,”arXiv preprint arXiv:2604.19528, 2026
Pith/arXiv arXiv 2026
-
[13]
The fast johnson–lindenstrauss transform and approximate nearest neighbors,
N. Ailon and B. Chazelle, “The fast johnson–lindenstrauss transform and approximate nearest neighbors,”SIAM J. Comput., vol. 39, no. 1, pp. 302–322, 2009
2009
-
[14]
Probability inequalities for sums of bounded random variables,
W. Hoeffding, “Probability inequalities for sums of bounded random variables,”J. Am. Stat. Assoc., vol. 58, no. 301, pp. 13–30, 1963
1963
-
[15]
dbpedia-entities-openai3-text-embedding-3-large-1536-1m dataset,
Qdrant, “dbpedia-entities-openai3-text-embedding-3-large-1536-1m dataset,” Hugging Face Datasets Repository, 2024. APPENDIXA: SUB-GAUSSIANCONCENTRATION OF THE FJLT PROJECTION Letx∈R d be a unit vector such as∥x∥ 2 = 1. The i-th coordinate of the projected vectory= 1√ dHDxis given byy i = 1√ d Pd j=1 Wj,whereW j =H i,jDj,jxj. BecauseH∈ {−1,1} d×d andDis a ...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.