Fast-TurboQuant: A Multiplier-Free Online Vector Quantization Approach

Felipe A. P. de Figueiredo; Pedro M. R. Pereira; Rausley A. A. de Souza

arxiv: 2606.21448 · v1 · pith:L5ZK3KHQnew · submitted 2026-06-19 · 💻 cs.LG · cs.IT· eess.SP· math.IT

Fast-TurboQuant: A Multiplier-Free Online Vector Quantization Approach

Pedro M. R. Pereira , Felipe A. P. de Figueiredo , Rausley A. A. de Souza This is my paper

Pith reviewed 2026-06-26 14:37 UTC · model grok-4.3

classification 💻 cs.LG cs.ITeess.SPmath.IT

keywords vector quantizationmultiplier-freefast Walsh-Hadamard transformRademacher phase inversionJohnson-Lindenstrauss transform1-bit quantizationsub-Gaussian concentrationkey-value cache

0 comments

The pith

Fast-TurboQuant replaces dense random rotations with a Rademacher phase inversion and fast Walsh-Hadamard transform to enable multiplier-free vector quantization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that TurboQuant's dense matrix projection can be replaced by a structured transform of random sign flips followed by the fast Walsh-Hadamard transform. This change preserves the sub-Gaussian concentration properties required for scalar Lloyd-Max quantization of vectors. As a result the projection step uses only additions instead of multiplications. On DBpedia OpenAI-3 Large embeddings the method produces a 19.7 times algorithmic speedup in sequential execution along with lower mean squared error and higher Recall@10 from the zero-padding dimension expansion. The substitution directly targets the memory-bandwidth bottleneck in large language model key-value caches and retrieval systems on constrained hardware.

Core claim

Fast-TurboQuant substitutes the dense random rotation matrix of TurboQuant with a Rademacher phase inversion followed by a fast Walsh-Hadamard transform. The structured transform satisfies the sub-Gaussian concentration requirements for Lloyd-Max quantization, reducing arithmetic operations to additions alone. On DBpedia OpenAI-3 Large embeddings, this yields a 19.7 times algorithmic speedup in sequential execution, lower mean squared error, and improved Recall@10 from the dimension expansion.

What carries the argument

Rademacher phase inversion followed by fast Walsh-Hadamard transform (FWHT) as a multiplier-free structured Johnson-Lindenstrauss transform that conditions vector distributions for scalar quantization.

Load-bearing premise

The sub-Gaussian concentration properties of the Rademacher-plus-FWHT transform are sufficient to replace the dense random rotation matrix while preserving downstream quantization error and recall performance.

What would settle it

If the mean squared quantization error or Recall@10 on a new set of embeddings is higher with Fast-TurboQuant than with the original dense-matrix TurboQuant, the performance-preservation claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.21448 by Felipe A. P. de Figueiredo, Pedro M. R. Pereira, Rausley A. A. de Souza.

read the original abstract

As large language models scale, memory bandwidth for key-value caches and retrieval-augmented generation systems becomes a critical bottleneck. While 1-bit quantization addresses this constraint, recent TurboQuant relies on dense random rotation matrices to condition the vector distribution before quantization. This projection demands millions of floating-point multiplications per embedding, making it difficult to deploy on constrained edge silicon. We introduce Fast-TurboQuant, a multiplier-free projection architecture that replaces the dense matrix with a structured fast Johnson-Lindenstrauss transform. By applying a Rademacher phase inversion followed by a fast Walsh-Hadamard transform (FWHT), the method leverages sub-Gaussian concentration to satisfy the prerequisites of scalar Lloyd-Max quantization without Gaussian projections. This substitution reduces the arithmetic complexity to only additions, eliminating hardware multipliers. Evaluation on DBpedia OpenAI-3 Large embeddings demonstrates a 19.7 times algorithmic speedup under sequential execution. Furthermore, the dimension expansion due to the FWHT zero-padding reduces the mean squared error and improves Recall@10.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fast-TurboQuant replaces the dense rotation in TurboQuant with Rademacher signs plus FWHT to eliminate multiplies, but the sub-Gaussian justification for keeping Lloyd-Max error the same rests on an untested assumption.

read the letter

The paper's core move is swapping the dense random rotation matrix in TurboQuant for a Rademacher phase flip followed by a fast Walsh-Hadamard transform. This cuts the projection to additions only and delivers the claimed 19.7x sequential speedup on DBpedia OpenAI-3 embeddings. The dimension expansion from zero-padding also shows up as lower MSE and higher Recall@10.

That substitution is the actual new piece. Prior work on structured Johnson-Lindenstrauss transforms exists, but applying this exact combination to the online 1-bit quantization step in TurboQuant does not appear in the cited literature.

The soft spot is the missing verification that the structured transform preserves the coordinate-wise sub-Gaussian tails Lloyd-Max needs. The abstract invokes concentration arguments, yet the description gives no tail plots, kurtosis values, or direct before-and-after quantization error on the same vectors. Without those, the recall gains could come mainly from the extra dimensions rather than the multiplier-free projection itself.

The stress-test concern lands: FWHT after random signs produces linear combinations with ±1 coefficients, which can keep marginal sub-Gaussianity but does not automatically deliver the near-independence or isotropy a dense orthogonal matrix supplies. If the paper contains no empirical check on this point, the central claim stays under-supported.

This is for engineers who need to run vector quantization on edge silicon where multipliers are expensive. A practitioner could extract the architecture and test it locally. A theorist or someone needing tight error bounds would find the current evidence thin.

Send it to review. The deployment angle is real and the implementation is straightforward, but the authors should supply the distribution checks before the speedup numbers can be taken at face value.

Referee Report

2 major / 1 minor

Summary. The paper introduces Fast-TurboQuant, which replaces the dense random rotation matrix of TurboQuant with a multiplier-free structured projection consisting of Rademacher sign flips followed by a fast Walsh-Hadamard transform (FWHT) with zero-padding. It claims that this transform satisfies the sub-Gaussian concentration prerequisites for scalar Lloyd-Max quantization, yielding only additions, a 19.7× algorithmic speedup on sequential execution, lower MSE, and higher Recall@10 on DBpedia OpenAI-3 Large embeddings due to the dimension expansion.

Significance. If the sub-Gaussian tail properties and isotropy of the structured transform are shown to match those of the dense random orthogonal matrix sufficiently for Lloyd-Max error bounds, the multiplier-free design would be a practical advance for deploying vector quantization on edge hardware without floating-point multipliers. The explicit credit for reducing arithmetic to additions and the reported recall improvement via zero-padding are strengths, but the absence of any tail-bound verification or ablation against the dense baseline limits the current impact.

major comments (2)

[Abstract] Abstract: The central claim that 'the method leverages sub-Gaussian concentration to satisfy the prerequisites of scalar Lloyd-Max quantization without Gaussian projections' is load-bearing for all performance attributions, yet the manuscript provides neither a derivation of the coordinate-wise moment conditions after FWHT nor any empirical comparison (e.g., kurtosis, tail quantiles, or per-coordinate quantization MSE) against the dense random rotation of the original TurboQuant.
[Evaluation] Evaluation section (implied by the DBpedia results): The reported 19.7× speedup and Recall@10 gains are presented without error bars, ablation tables isolating the effect of the structured transform versus dimension expansion, or direct head-to-head quantization-error measurements on identical embeddings before and after the Rademacher+FWHT step; this prevents confirmation that the downstream metrics are attributable to the multiplier-free projection rather than the padding.

minor comments (1)

[Abstract] Notation for the FWHT zero-padding factor and the exact dimension after expansion is not defined in the abstract or early sections, making it difficult to reproduce the claimed MSE reduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We appreciate the feedback highlighting areas where additional theoretical and empirical support would strengthen the manuscript. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'the method leverages sub-Gaussian concentration to satisfy the prerequisites of scalar Lloyd-Max quantization without Gaussian projections' is load-bearing for all performance attributions, yet the manuscript provides neither a derivation of the coordinate-wise moment conditions after FWHT nor any empirical comparison (e.g., kurtosis, tail quantiles, or per-coordinate quantization MSE) against the dense random rotation of the original TurboQuant.

Authors: We concur that an explicit derivation of the coordinate-wise moment conditions and tail bounds for the Rademacher sign flips followed by FWHT would provide stronger justification for the sub-Gaussian concentration claim. While the structured fast Johnson-Lindenstrauss transform is established in the literature to preserve sub-Gaussian properties, we will add a dedicated subsection deriving these properties specifically for our setting. We will also include empirical plots comparing kurtosis, tail quantiles, and per-coordinate quantization MSE between the structured transform and the dense random rotation on the DBpedia embeddings. These additions will be incorporated in the revised manuscript. revision: yes
Referee: [Evaluation] Evaluation section (implied by the DBpedia results): The reported 19.7× speedup and Recall@10 gains are presented without error bars, ablation tables isolating the effect of the structured transform versus dimension expansion, or direct head-to-head quantization-error measurements on identical embeddings before and after the Rademacher+FWHT step; this prevents confirmation that the downstream metrics are attributable to the multiplier-free projection rather than the padding.

Authors: We agree that the current presentation lacks error bars and ablations, which limits the ability to attribute improvements precisely. The speedup figure is from sequential execution timing, and the recall improvement is linked to dimension expansion via zero-padding as noted. To address this, we will add error bars based on multiple runs for the timing and Recall@10 metrics. We will also include an ablation study with tables showing quantization MSE and recall for: (1) original TurboQuant dense projection, (2) our structured projection without padding, (3) with padding. This will isolate the contributions of the multiplier-free transform versus the dimension increase. These revisions will be made to the Evaluation section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external transform properties and empirical validation

full rationale

The paper proposes replacing dense random rotations with a Rademacher sign-flip plus FWHT structured transform, invoking sub-Gaussian concentration to justify compatibility with Lloyd-Max quantization, then reports empirical speedups and recall gains on DBpedia embeddings. No equations define a fitted parameter that is subsequently renamed as a prediction, no self-citation chain bears the central claim, and no result reduces by construction to quantities defined inside the paper. The speedup (additions only) and MSE improvement (from zero-padding) are measured outcomes on held-out data, not tautological restatements of the method's definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the known concentration properties of the fast Walsh-Hadamard transform combined with random sign flips; no new free parameters, ad-hoc axioms, or invented entities are introduced in the abstract.

axioms (1)

domain assumption Rademacher random variables combined with the fast Walsh-Hadamard transform produce sub-Gaussian vectors whose concentration is adequate for Lloyd-Max scalar quantization.
Invoked to justify replacement of dense Gaussian projections.

pith-pipeline@v0.9.1-grok · 5727 in / 1185 out tokens · 11956 ms · 2026-06-26T14:37:34.568272+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 5 linked inside Pith

[1]

Pqcache: Product quantization-based kvcache for long context llm inference,

H. Zhang, X. Ji, Y . Chen, F. Fu, X. Miao, X. Nie, W. Chen, and B. Cui, “Pqcache: Product quantization-based kvcache for long context llm inference,” inProc. ACM on Management of Data, vol. 3, no. 3. ACM New York, NY , USA, 2025, pp. 1–30

2025
[2]

Turbo- quant: 1-bit similarity estimation for llm inference,

A. Zandieh, M. Daliri, M. Hadian, and V . Mirrokni, “Turbo- quant: 1-bit similarity estimation for llm inference,”arXiv preprint arXiv:2504.19874, 2025

Pith/arXiv arXiv 2025
[3]

Orpquant: Geometric orthogonal residual projection for multiplier-free power-of-two transformer quan- tization,

M. Xiang, B. Wang, and T. Luo, “Orpquant: Geometric orthogonal residual projection for multiplier-free power-of-two transformer quan- tization,”arXiv preprint arXiv:2605.26092, 2026

Pith/arXiv arXiv 2026
[4]

Shiftaddllm: Accelerating pretrained llms via post-training multiplication-less reparameterization,

H. You, Y . Guo, Y . Fu, W. Zhou, H. Shi, X. Zhang, S. Kundu, A. Yazdanbakhsh, and Y . C. Lin, “Shiftaddllm: Accelerating pretrained llms via post-training multiplication-less reparameterization,”Advances in Neural Information Processing Systems, vol. 37, pp. 24 822–24 848, 2024

2024
[5]

Spinquant: Llm quantization with learned rotations,

Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V . Chandra, Y . Tian, and T. Blankevoort, “Spinquant: Llm quantization with learned rotations,” inInternational Conference on Learning Rep- resentations, vol. 2025, 2025, pp. 92 009–92 032

2025
[6]

Quarot: Outlier-free 4-bit inference in rotated llms,

S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman, “Quarot: Outlier-free 4-bit inference in rotated llms,”Advances in Neural Information Processing Systems, vol. 37, pp. 100 213–100 240, 2024

2024
[7]

Polarquant: Optimal gaussian weight quantiza- tion via hadamard rotation for llm compression,

C. Vicentino, “Polarquant: Optimal gaussian weight quantiza- tion via hadamard rotation for llm compression,”arXiv preprint arXiv:2603.29078, 2026

Pith/arXiv arXiv 2026
[8]

Kvlinc: Kv cache quantization with hadamard rotation and linear correction,

U. Saxena and K. Roy, “Kvlinc: Kv cache quantization with hadamard rotation and linear correction,”arXiv preprint arXiv:2510.05373, 2025

arXiv 2025
[9]

Prov- able quantization with randomized hadamard transform,

Y . Feng, P. Indyk, M. Kapralov, D. Krachun, and B. Prokhorov, “Prov- able quantization with randomized hadamard transform,”arXiv preprint arXiv:2605.13810, 2026

Pith/arXiv arXiv 2026
[10]

Itq3 s: High-fidelity 3-bit llm inference via interleaved ternary quantization with rotation-domain smoothing,

E. J. Yoon, “Itq3 s: High-fidelity 3-bit llm inference via interleaved ternary quantization with rotation-domain smoothing,”arXiv preprint arXiv:2603.27914, 2026

arXiv 2026
[11]

Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search,

J. Gao and C. Long, “Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search,” Proceedings of the ACM on Management of Data, vol. 2, no. 3, pp. 1–27, 2024

2024
[12]

Revisiting rabitq and turboquant: A symmetric comparison of methods, theory, and experiments,

J. Gao, Y . Gou, Y . Xu, J. Shi, Y . Yang, S. Li, R. C.-W. Wong, and C. Long, “Revisiting rabitq and turboquant: A symmetric comparison of methods, theory, and experiments,”arXiv preprint arXiv:2604.19528, 2026

Pith/arXiv arXiv 2026
[13]

The fast johnson–lindenstrauss transform and approximate nearest neighbors,

N. Ailon and B. Chazelle, “The fast johnson–lindenstrauss transform and approximate nearest neighbors,”SIAM J. Comput., vol. 39, no. 1, pp. 302–322, 2009

2009
[14]

Probability inequalities for sums of bounded random variables,

W. Hoeffding, “Probability inequalities for sums of bounded random variables,”J. Am. Stat. Assoc., vol. 58, no. 301, pp. 13–30, 1963

1963
[15]

dbpedia-entities-openai3-text-embedding-3-large-1536-1m dataset,

Qdrant, “dbpedia-entities-openai3-text-embedding-3-large-1536-1m dataset,” Hugging Face Datasets Repository, 2024. APPENDIXA: SUB-GAUSSIANCONCENTRATION OF THE FJLT PROJECTION Letx∈R d be a unit vector such as∥x∥ 2 = 1. The i-th coordinate of the projected vectory= 1√ dHDxis given byy i = 1√ d Pd j=1 Wj,whereW j =H i,jDj,jxj. BecauseH∈ {−1,1} d×d andDis a ...

2024

[1] [1]

Pqcache: Product quantization-based kvcache for long context llm inference,

H. Zhang, X. Ji, Y . Chen, F. Fu, X. Miao, X. Nie, W. Chen, and B. Cui, “Pqcache: Product quantization-based kvcache for long context llm inference,” inProc. ACM on Management of Data, vol. 3, no. 3. ACM New York, NY , USA, 2025, pp. 1–30

2025

[2] [2]

Turbo- quant: 1-bit similarity estimation for llm inference,

A. Zandieh, M. Daliri, M. Hadian, and V . Mirrokni, “Turbo- quant: 1-bit similarity estimation for llm inference,”arXiv preprint arXiv:2504.19874, 2025

Pith/arXiv arXiv 2025

[3] [3]

Orpquant: Geometric orthogonal residual projection for multiplier-free power-of-two transformer quan- tization,

M. Xiang, B. Wang, and T. Luo, “Orpquant: Geometric orthogonal residual projection for multiplier-free power-of-two transformer quan- tization,”arXiv preprint arXiv:2605.26092, 2026

Pith/arXiv arXiv 2026

[4] [4]

Shiftaddllm: Accelerating pretrained llms via post-training multiplication-less reparameterization,

H. You, Y . Guo, Y . Fu, W. Zhou, H. Shi, X. Zhang, S. Kundu, A. Yazdanbakhsh, and Y . C. Lin, “Shiftaddllm: Accelerating pretrained llms via post-training multiplication-less reparameterization,”Advances in Neural Information Processing Systems, vol. 37, pp. 24 822–24 848, 2024

2024

[5] [5]

Spinquant: Llm quantization with learned rotations,

Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V . Chandra, Y . Tian, and T. Blankevoort, “Spinquant: Llm quantization with learned rotations,” inInternational Conference on Learning Rep- resentations, vol. 2025, 2025, pp. 92 009–92 032

2025

[6] [6]

Quarot: Outlier-free 4-bit inference in rotated llms,

S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman, “Quarot: Outlier-free 4-bit inference in rotated llms,”Advances in Neural Information Processing Systems, vol. 37, pp. 100 213–100 240, 2024

2024

[7] [7]

Polarquant: Optimal gaussian weight quantiza- tion via hadamard rotation for llm compression,

C. Vicentino, “Polarquant: Optimal gaussian weight quantiza- tion via hadamard rotation for llm compression,”arXiv preprint arXiv:2603.29078, 2026

Pith/arXiv arXiv 2026

[8] [8]

Kvlinc: Kv cache quantization with hadamard rotation and linear correction,

U. Saxena and K. Roy, “Kvlinc: Kv cache quantization with hadamard rotation and linear correction,”arXiv preprint arXiv:2510.05373, 2025

arXiv 2025

[9] [9]

Prov- able quantization with randomized hadamard transform,

Y . Feng, P. Indyk, M. Kapralov, D. Krachun, and B. Prokhorov, “Prov- able quantization with randomized hadamard transform,”arXiv preprint arXiv:2605.13810, 2026

Pith/arXiv arXiv 2026

[10] [10]

Itq3 s: High-fidelity 3-bit llm inference via interleaved ternary quantization with rotation-domain smoothing,

E. J. Yoon, “Itq3 s: High-fidelity 3-bit llm inference via interleaved ternary quantization with rotation-domain smoothing,”arXiv preprint arXiv:2603.27914, 2026

arXiv 2026

[11] [11]

Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search,

J. Gao and C. Long, “Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search,” Proceedings of the ACM on Management of Data, vol. 2, no. 3, pp. 1–27, 2024

2024

[12] [12]

Revisiting rabitq and turboquant: A symmetric comparison of methods, theory, and experiments,

J. Gao, Y . Gou, Y . Xu, J. Shi, Y . Yang, S. Li, R. C.-W. Wong, and C. Long, “Revisiting rabitq and turboquant: A symmetric comparison of methods, theory, and experiments,”arXiv preprint arXiv:2604.19528, 2026

Pith/arXiv arXiv 2026

[13] [13]

The fast johnson–lindenstrauss transform and approximate nearest neighbors,

N. Ailon and B. Chazelle, “The fast johnson–lindenstrauss transform and approximate nearest neighbors,”SIAM J. Comput., vol. 39, no. 1, pp. 302–322, 2009

2009

[14] [14]

Probability inequalities for sums of bounded random variables,

W. Hoeffding, “Probability inequalities for sums of bounded random variables,”J. Am. Stat. Assoc., vol. 58, no. 301, pp. 13–30, 1963

1963

[15] [15]

dbpedia-entities-openai3-text-embedding-3-large-1536-1m dataset,

Qdrant, “dbpedia-entities-openai3-text-embedding-3-large-1536-1m dataset,” Hugging Face Datasets Repository, 2024. APPENDIXA: SUB-GAUSSIANCONCENTRATION OF THE FJLT PROJECTION Letx∈R d be a unit vector such as∥x∥ 2 = 1. The i-th coordinate of the projected vectory= 1√ dHDxis given byy i = 1√ d Pd j=1 Wj,whereW j =H i,jDj,jxj. BecauseH∈ {−1,1} d×d andDis a ...

2024