pith. sign in

arxiv: 2504.19874 · v1 · pith:S4XIHWQSnew · submitted 2025-04-28 · 💻 cs.LG · cs.AI· cs.DB· cs.DS

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Pith reviewed 2026-05-20 08:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DBcs.DS
keywords vector quantizationdistortion rateonline quantizationrandom rotationscalar quantizationKV cache quantizationnearest neighbor searchinformation theoretic bounds
0
0 comments X

The pith

TurboQuant achieves vector quantization with distortion rates within a small constant factor of the information-theoretic minimum.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TurboQuant, a data-oblivious online algorithm for quantizing high-dimensional Euclidean vectors to minimize either mean-squared error or inner-product distortion. It works by applying a random rotation that concentrates coordinate values into a Beta distribution, then quantizing each coordinate independently with an optimal scalar quantizer. Because high-dimensional vectors have nearly independent coordinates after rotation, this simple per-coordinate approach stays close to the best possible rate. The authors prove matching lower bounds showing that no vector quantizer can do substantially better, and TurboQuant reaches within a factor of roughly 2.7 of those bounds. The method is tested on KV-cache compression and nearest-neighbor search, where it delivers near-neutral quality at 3.5 bits per channel.

Core claim

TurboQuant randomly rotates input vectors to induce a concentrated Beta distribution on coordinates and then applies optimal scalar quantizers independently to each coordinate. This yields near-optimal mean-squared-error distortion. For unbiased inner-product estimation, an MSE quantizer is followed by a 1-bit Quantized JL transform on the residual. The resulting scheme matches the proven information-theoretic lower bound on distortion rate up to a small constant factor of approximately 2.7 across all bit widths and dimensions.

What carries the argument

Random rotation that concentrates coordinates into a Beta distribution, enabling independent optimal scalar quantization per coordinate while preserving near-optimality due to high-dimensional near-independence.

If this is right

  • KV-cache quantization reaches absolute quality neutrality at 3.5 bits per channel.
  • Only marginal quality degradation appears at 2.5 bits per channel for KV caches.
  • Nearest-neighbor search recall improves over product quantization while indexing time drops to essentially zero.
  • The same scheme works online and data-obliviously for every bit width and dimension.
  • Any future vector quantizer must respect the same information-theoretic lower bounds established in the paper.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The rotation-plus-scalar-quantizer pattern could be combined with learned per-coordinate codebooks to tighten the constant factor in domain-specific settings.
  • Because no data-dependent preprocessing is required, the method fits naturally into streaming or continually changing high-dimensional data pipelines.
  • The same independence argument may extend to other distortion measures such as cosine similarity or Wasserstein distance once suitable scalar quantizers are identified.

Load-bearing premise

After random rotation the coordinates become sufficiently near-independent in high dimensions that separate scalar quantization loses only a small constant factor in distortion compared with any joint scheme.

What would settle it

A calculation or experiment in which TurboQuant's distortion rate exceeds three times the proven lower bound for some bit width and dimension would show the near-optimality claim is false.

read the original abstract

Vector quantization, a problem rooted in Shannon's source coding theory, aims to quantize high-dimensional Euclidean vectors while minimizing distortion in their geometric structure. We propose TurboQuant to address both mean-squared error (MSE) and inner product distortion, overcoming limitations of existing methods that fail to achieve optimal distortion rates. Our data-oblivious algorithms, suitable for online applications, achieve near-optimal distortion rates (within a small constant factor) across all bit-widths and dimensions. TurboQuant achieves this by randomly rotating input vectors, inducing a concentrated Beta distribution on coordinates, and leveraging the near-independence property of distinct coordinates in high dimensions to simply apply optimal scalar quantizers per each coordinate. Recognizing that MSE-optimal quantizers introduce bias in inner product estimation, we propose a two-stage approach: applying an MSE quantizer followed by a 1-bit Quantized JL (QJL) transform on the residual, resulting in an unbiased inner product quantizer. We also provide a formal proof of the information-theoretic lower bounds on best achievable distortion rate by any vector quantizer, demonstrating that TurboQuant closely matches these bounds, differing only by a small constant ($\approx 2.7$) factor. Experimental results validate our theoretical findings, showing that for KV cache quantization, we achieve absolute quality neutrality with 3.5 bits per channel and marginal quality degradation with 2.5 bits per channel. Furthermore, in nearest neighbor search tasks, our method outperforms existing product quantization techniques in recall while reducing indexing time to virtually zero.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes TurboQuant, a data-oblivious online vector quantization algorithm for high-dimensional Euclidean vectors. It applies a random rotation to induce concentrated Beta marginals on coordinates, then performs independent scalar quantization per coordinate. For inner-product preservation it adds a two-stage correction using an MSE quantizer followed by a 1-bit Quantized JL transform. The authors supply a formal proof of information-theoretic lower bounds on achievable distortion and claim that TurboQuant attains rates within a small constant factor (approximately 2.7) of these bounds across bit-widths and dimensions. Experiments report absolute quality neutrality at 3.5 bits per channel for KV-cache quantization and improved recall with near-zero indexing time for nearest-neighbor search.

Significance. If the central claims are substantiated, the work offers a theoretically grounded, online quantization primitive that is attractive for memory-constrained inference in large models. The combination of a data-oblivious construction, explicit lower-bound proof, and concrete task-level validation constitutes a useful contribution to the quantization literature.

major comments (2)
  1. [Core algorithm and Beta distribution analysis] Core algorithm description and Beta-concentration analysis: the central claim that random rotation plus per-coordinate scalar quantization yields distortion within a small constant factor of the vector-quantization lower bound rests on the coordinates behaving as sufficiently near-independent after rotation. The unit-norm constraint induces negative dependence (sum x_i^2 = ||v||^2) that is not obviously canceled by the marginal concentration argument; this dependence could enlarge the mutual-information gap or error accumulation beyond the stated ≈2.7 factor, especially at moderate dimensions or low bit-widths. A quantitative bound that accounts for the residual joint dependence is needed to support the optimality claim.
  2. [Lower-bound proof section] Formal proof of information-theoretic lower bounds: the proof is presented as independent of the algorithm construction, yet the manuscript must explicitly show how the reported empirical distortion rates are obtained relative to the derived lower bound without post-hoc fitting from the same evaluation data. Without this explicit linkage, the constant-factor comparison cannot be verified as non-circular.
minor comments (2)
  1. [Experimental results] Experimental sections on KV-cache and nearest-neighbor search should report standard deviations or results over multiple random seeds and datasets to allow assessment of variability in the quality metrics.
  2. [Notation and definitions] Clarify the precise normalization used for the distortion-rate quantity when comparing across bit-widths and dimensions in both theory and experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review and the recommendation for major revision. We believe the concerns can be addressed by clarifying the analysis and adding explicit comparisons, as detailed in our point-by-point responses below.

read point-by-point responses
  1. Referee: Core algorithm and Beta distribution analysis: the central claim that random rotation plus per-coordinate scalar quantization yields distortion within a small constant factor of the vector-quantization lower bound rests on the coordinates behaving as sufficiently near-independent after rotation. The unit-norm constraint induces negative dependence (sum x_i^2 = ||v||^2) that is not obviously canceled by the marginal concentration argument; this dependence could enlarge the mutual-information gap or error accumulation beyond the stated ≈2.7 factor, especially at moderate dimensions or low bit-widths. A quantitative bound that accounts for the residual joint dependence is needed to support the optimality claim.

    Authors: We agree that the unit-norm constraint introduces negative dependence among coordinates. However, for high-dimensional vectors uniformly distributed on the sphere, after a random orthogonal transformation, the coordinates are known to be approximately independent with the dependence becoming negligible as the dimension increases. Our proof leverages concentration properties of the Beta distribution for the marginals and bounds the quantization error using these marginals, with the constant factor of approximately 2.7 derived from the ratio of the vector rate-distortion function to the scalar one under the induced distribution. To strengthen this, we will add a quantitative analysis of the residual dependence in the revised manuscript, showing that it does not increase the factor beyond 2.7 for dimensions d >= 64 and bit-widths >= 2, using bounds from negative association or Stein's method. revision: partial

  2. Referee: Formal proof of information-theoretic lower bounds: the proof is presented as independent of the algorithm construction, yet the manuscript must explicitly show how the reported empirical distortion rates are obtained relative to the derived lower bound without post-hoc fitting from the same evaluation data. Without this explicit linkage, the constant-factor comparison cannot be verified as non-circular.

    Authors: The information-theoretic lower bound is derived independently in the proof section using standard techniques from rate-distortion theory applied to the distortion measures (MSE and inner-product), without reference to TurboQuant. The empirical rates are obtained by running the algorithm on test vectors and computing the average distortion directly. The factor ≈2.7 is the maximum observed ratio across our experiments and is consistent with the theoretical gap between scalar and vector quantization under Beta marginals. We will revise the manuscript to include an explicit comparison table that lists the theoretical lower bound, the achieved distortion for TurboQuant, and the ratio for each setting, making the non-circular nature clear. No post-hoc fitting was performed; the constant is an upper bound on the observed ratios. revision: yes

Circularity Check

0 steps flagged

No significant circularity; lower-bound proof and algorithm analysis are independent

full rationale

The paper derives its main result by first proving an information-theoretic lower bound on distortion for any vector quantizer, then separately analyzing TurboQuant's distortion under random rotation plus per-coordinate scalar quantization using Beta marginals. The lower-bound proof is presented as a standalone information-theoretic argument that does not invoke the specific TurboQuant construction or its fitted constants. Distortion rates for the method are obtained from explicit concentration bounds and scalar quantization error formulas rather than by fitting parameters to the same evaluation data used to report empirical performance. The near-independence assumption after rotation is an explicit modeling choice applied uniformly to both the algorithm and the gap analysis, but it does not create a definitional loop or reduce the claimed constant-factor gap to a tautology. No self-citations, ansatzes smuggled via prior work, or renaming of known results appear as load-bearing steps in the provided derivation outline.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The construction rests on standard high-dimensional concentration properties of random rotations and the near-independence of coordinates after rotation; no new free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)
  • domain assumption Random orthogonal transformations concentrate coordinate marginals to a Beta distribution and render distinct coordinates nearly independent in high dimensions.
    Invoked to justify applying independent scalar quantizers after rotation.

pith-pipeline@v0.9.0 · 5820 in / 1427 out tokens · 40402 ms · 2026-05-20T08:03:49.206400+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.DimensionForcing alexander_duality_circle_linking echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    randomly rotating input vectors, inducing a concentrated Beta distribution on coordinates, and leveraging the near-independence property of distinct coordinates in high dimensions to simply apply optimal scalar quantizers per each coordinate

  • IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We also provide a formal proof of the information-theoretic lower bounds on best achievable distortion rate by any vector quantizer, demonstrating that TurboQuant closely matches these bounds, differing only by a small constant (≈2.7) factor

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

    cs.LG 2026-05 conditional novelty 8.0

    HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.

  2. Block-Sphere Vector Quantization

    cs.LG 2026-05 unverdicted novelty 7.0

    BlockQuant is a new block quantization algorithm on the sphere after random rotation that theoretically improves reconstruction MSE and expected inner-product distortion over EDEN, RabitQ, and TurboQuant.

  3. LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell ...

  4. PrismQuant: Rate-Distortion-Optimal Vector Quantization for Gaussian-Mixture Sources

    cs.IT 2026-05 conditional novelty 7.0

    PrismQuant achieves near rate-distortion optimality for Gaussian-mixture sources by losslessly transmitting the mixture component label at H(C)/n bits per dimension and applying component-matched KLT plus scalar quant...

  5. Quantizing With Randomized Hadamard Transforms: Efficient Heuristic Now Proven

    cs.LG 2026-05 unverdicted novelty 7.0

    Two randomized Hadamard transforms suffice to make coordinate marginals O(d^{-1/2})-close to Gaussian for most quantization methods, with three needed for vector quantization to match uniform random rotations asymptotically.

  6. When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

    cs.PF 2026-05 unverdicted novelty 7.0

    A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.

  7. Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit

    cs.LG 2026-04 unverdicted novelty 7.0

    Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.

  8. 3DTurboQuant: Training-Free Near-Optimal Quantization for 3D Reconstruction Models

    cs.CV 2026-04 conditional novelty 7.0

    3DTurboQuant achieves training-free near-optimal quantization for 3DGS and DUSt3R models via random rotations inducing Beta distributions, enabling precomputed Lloyd-Max quantizers that deliver 3.5x and 7.9x compressi...

  9. SuperLocalMemory V3.3: The Living Brain -- Biologically-Inspired Forgetting, Cognitive Quantization, and Multi-Channel Retrieval for Zero-LLM Agent Memory Systems

    cs.AI 2026-04 unverdicted novelty 7.0

    SuperLocalMemory V3.3 implements a cognitive memory taxonomy with mathematical forgetting and multi-channel retrieval, reaching 70.4% on LoCoMo in zero-LLM mode.

  10. PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-...

  11. OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

    cs.LG 2026-05 unverdicted novelty 6.0

    OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.

  12. OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

    cs.LG 2026-05 unverdicted novelty 6.0

    OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.

  13. VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

    cs.AR 2026-05 unverdicted novelty 6.0

    VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.

  14. Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours

    cs.AR 2026-05 unverdicted novelty 6.0

    Design Conductor 2.0 uses April 2026 frontier models to autonomously create a 5129-unit FP16/32 TurboQuant inference accelerator mapped to FPGA at 125 MHz in 80 hours.

  15. HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

    cs.LG 2026-05 unverdicted novelty 6.0

    HeadQ reduces 84-94% of excess perplexity in 2-bit key quantization by adding low-rank logit corrections in a calibration-learned query basis, with further gains from an A^2-weighted value policy.

  16. Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant

    cs.LG 2026-04 unverdicted novelty 6.0

    At 4-bit budget KQV wins on KL divergence, geometric K error and 6D distance with unconditional K-V asymmetry; QKQV wins geometrically at other budgets because the Jensen-amplified variance inflation from QJL on K doe...

  17. HARBOR: Automated Harness Optimization

    cs.LG 2026-04 unverdicted novelty 6.0

    HARBOR formalizes harness optimization as constrained noisy Bayesian optimization over mixed-variable spaces and reports a case study where it outperforms manual tuning on a production coding agent.

  18. SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

    cs.LG 2026-04 unverdicted novelty 6.0

    Token-wise INT4 KV-cache quantization plus block-diagonal Hadamard rotation recovers nearly all accuracy lost by naive INT4 while adding zero end-to-end overhead under paged serving constraints.

  19. Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon

    cs.LG 2026-04 unverdicted novelty 6.0

    Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.

  20. eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization

    cs.LG 2026-04 unverdicted novelty 6.0

    eOptShrinkQ compresses KV caches to ~2.2 bits per entry via optimal spectral shrinkage and quantization, outperforming prior methods on LongBench while matching FP16 on multi-needle retrieval.

  21. High-Rate Quantized Matrix Multiplication I

    cs.IT 2026-01 unverdicted novelty 5.0

    High-rate quantization theory yields accurate approximations for the distortion of absmax INT and FP schemes in generic weight-plus-activation matrix multiplication.

  22. Hierarchical vs. Flat Iteration in Shared-Weight Transformers

    cs.CL 2026-04 unverdicted novelty 4.0

    Hierarchical two-speed shared-weight recurrence in Transformers shows a sharp performance gap compared to independent layer stacking in empirical language modeling tests.

  23. ECG Foundation Models and Medical LLMs for Agentic Cardiovascular Intelligence at the Edge: A Review and Outlook

    eess.SP 2026-04 unverdicted novelty 3.0

    ECG foundation models for signal interpretation and medical LLMs for reasoning can be integrated into agentic systems for real-time cardiovascular intelligence on edge devices.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 22 Pith papers · 15 internal anchors

  1. [1]

    https://www.elastic.co/enterprise-search/vector-search

    Elastic search., 2025. https://www.elastic.co/enterprise-search/vector-search

  2. [2]

    https://qdrant.tech/

    Qdrant vectore search., 2025. https://qdrant.tech/

  3. [3]

    https://github.com/pgvector/pgvector/

    Pgvector search., 2025. https://github.com/pgvector/pgvector/

  4. [4]

    https://www.pinecone.io/

    Pinecone vectore database., 2025. https://www.pinecone.io/

  5. [5]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  6. [6]

    Gqa: Training generalized multi-query transformer models from multi-head checkpoints

    Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., and Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. In Pro- ceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pp. 4895–4901, 2023

  7. [7]

    Claude, 2024

    Anthropic. Claude, 2024. https://www.anthropic.com/news/claude-3-family

  8. [8]

    L., Li, B., Cameron, P., Jaggi, M., Alistarh, D., Hoefler, T., and Hensman, J

    Ashkboos, S., Mohtashami, A., Croci, M. L., Li, B., Cameron, P., Jaggi, M., Alistarh, D., Hoefler, T., and Hensman, J. Quarot: Outlier-free 4-bit inference in rotated llms. arXiv preprint arXiv:2404.00456, 2024

  9. [9]

    and Lempitsky, V

    Babenko, A. and Lempitsky, V. Additive quantization for extreme vector compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 931– 938, 2014

  10. [10]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Bai, Y., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., Hou, L., Dong, Y., Tang, J., and Li, J. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508 , 2023

  11. [11]

    Longformer: The Long-Document Transformer

    Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020. 21

  12. [12]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    Cai, Z., Zhang, Y., Gao, B., Liu, Y., Liu, T., Lu, K., Xiong, W., Dong, Y., Chang, B., Hu, J., et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069 , 2024

  13. [13]

    Chee, J., Cai, Y., Kuleshov, V., and De Sa, C. M. Quip: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems , 36:4396–4429, 2023

  14. [14]

    Cover, T. M. Elements of information theory . John Wiley & Sons, 1999

  15. [15]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y., et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066 , 2024

  16. [16]

    Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. Gpt3. int8 (): 8-bit matrix mul- tiplication for transformers at scale. Advances in Neural Information Processing Systems , 35: 30318–30332, 2022

  17. [17]

    Qaq: Quality adaptive quantization for llm kv cache

    Dong, S., Cheng, W., Qin, J., and Wang, W. Qaq: Quality adaptive quantization for llm kv cache. arXiv preprint arXiv:2403.04643 , 2024

  18. [18]

    The Llama 3 Herd of Models

    Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024

  19. [19]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., and Larson, J. From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024

  20. [20]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quanti- zation for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 , 2022

  21. [21]

    Data engineering for scaling language models to 128k context

    Fu, Y., Panda, R., Niu, X., Yue, X., Hajishirzi, H., Kim, Y., and Peng, H. Data engineering for scaling language models to 128k context. arXiv preprint arXiv:2402.10171 , 2024. URL https://github.com/FranxYao/Long-Context-Data-Engineering

  22. [22]

    Gao, J., Gou, Y., Xu, Y., Yang, Y., Long, C., and Wong, R. C.-W. Practical and asymptotically optimal quantization of high-dimensional vectors in euclidean space for approximate nearest neighbor search. arXiv preprint arXiv:2409.09913 , 2024

  23. [23]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, H., and Wang, H. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2, 2023

  24. [24]

    Optimized product quantization for approximate near- est neighbor search

    Ge, T., He, K., Ke, Q., and Sun, J. Optimized product quantization for approximate near- est neighbor search. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2946–2953, 2013

  25. [25]

    Asymptotically optimal block quantization

    Gersho, A. Asymptotically optimal block quantization. IEEE Transactions on information theory, 25(4):373–380, 1979. 22

  26. [26]

    On the structure of vector quantizers

    Gersho, A. On the structure of vector quantizers. IEEE Transactions on Information Theory , 28(2):157–166, 1982

  27. [27]

    Accelerat- ing large-scale inference with anisotropic vector quantization

    Guo, R., Sun, P., Lindgren, E., Geng, Q., Simcha, D., Chern, F., and Kumar, S. Accelerat- ing large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning, pp. 3887–3896. PMLR, 2020

  28. [28]

    Polarquant: Quantizing kv caches with polar transformation

    Han, I., Kacham, P., Karbasi, A., Mirrokni, V., and Zandieh, A. Polarquant: Quantizing kv caches with polar transformation. arXiv preprint arXiv:2502.02617 , 2025

  29. [29]

    Balancekv: Kv cache compression through discrepancy theory

    Han, I., Kapralov, M., Kochetkova, E., Sheth, K., and Zandieh, A. Balancekv: Kv cache compression through discrepancy theory. arXiv preprint arXiv:2502.07861 , 2025

  30. [30]

    KVQuant: Towards 10 million context length llm inference with kv cache quantization.arXiv preprint arXiv:2401.18079, 2024

    Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y. S., Keutzer, K., and Gholami, A. Kvquant: Towards 10 million context length llm inference with kv cache quanti- zation. arXiv preprint arXiv:2401.18079 , 2024

  31. [31]

    Product quantization for nearest neighbor search

    Jegou, H., Douze, M., and Schmid, C. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence , 33(1):117–128, 2010

  32. [32]

    Needle in a haystack - pressure testing llms., 2023

    Kamradt, G. Needle in a haystack - pressure testing llms., 2023. https://github.com/ gkamradt/LLMTest_NeedleInAHaystack

  33. [33]

    GEAR: An efficient KV cache compression recipe for near-lossless generative inference of LLM.arXiv preprint arXiv:2403.05527, 2024

    Kang, H., Zhang, Q., Kundu, S., Jeong, G., Liu, Z., Krishna, T., and Zhao, T. Gear: An efficient kv cache compression recipefor near-lossless generative inference of llm. arXiv preprint arXiv:2403.05527, 2024

  34. [34]

    Scaling Laws for Neural Language Models

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  35. [35]

    and Zaharia, M

    Khattab, O. and Zaharia, M. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval , pp. 39–48, 2020

  36. [36]

    Lexico: Extreme kv cache compression via sparse coding over universal dictionaries

    Kim, J., Park, J., Cho, J., and Papailiopoulos, D. Lexico: Extreme kv cache compression via sparse coding over universal dictionaries. arXiv preprint arXiv:2412.08890 , 2024

  37. [37]

    W., and Keutzer, K

    Kim, S., Hooper, C., Gholami, A., Dong, Z., Li, X., Shen, S., Mahoney, M. W., and Keutzer, K. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629 , 2023

  38. [38]

    SnapKV: LLM Knows What You are Looking for Before Generation

    Li, Y., Huang, Y., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D. Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469, 2024

  39. [39]

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration

    Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems , 6:87–100, 2024

  40. [40]

    Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time

    Liu, Z., Desai, A., Liao, F., Wang, W., Xie, V., Xu, Z., Kyrillidis, A., and Shrivastava, A. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems , 36, 2024. 23

  41. [41]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V., Chen, B., and Hu, X. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024

  42. [42]

    Least squares quantization in pcm

    Lloyd, S. Least squares quantization in pcm. IEEE transactions on information theory , 28(2): 129–137, 1982

  43. [43]

    Quantizing for minimum distortion

    Max, J. Quantizing for minimum distortion. IRE Transactions on Information Theory , 6(1): 7–12, 1960

  44. [44]

    and Dite, W

    Panter, P. and Dite, W. Quantization distortion in pulse-count modulation with nonuniform spacing of levels. Proceedings of the IRE, 39(1):44–48, 1951

  45. [45]

    GloVe: Global vectors for word representation

    Pennington, J., Socher, R., and Manning, C. GloVe: Global vectors for word representation. In Moschitti, A., Pang, B., and Daelemans, W. (eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pp. 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1162. URL ...

  46. [46]

    Colbertv2: Effective and efficient retrieval via lightweight late interaction

    Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., and Zaharia, M. Colbertv2: Effective and efficient retrieval via lightweight late interaction. arXiv preprint arXiv:2112.01488 , 2021

  47. [47]

    Flashattention-3: Fast and accurate attention with asynchrony and low-precision,

    Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T. Flashattention- 3: Fast and accurate attention with asynchrony and low-precision. arXiv preprint arXiv:2407.08608, 2024

  48. [48]

    Shannon, C. E. A mathematical theory of communication. The Bell system technical journal , 27(3):379–423, 1948

  49. [49]

    Shannon, C. E. et al. Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec, 4(142-163):1, 1959

  50. [50]

    Fast Transformer Decoding: One Write-Head is All You Need

    Shazeer, N. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019

  51. [51]

    RotateKV: Accurate and robust 2-bit KV cache quantization for LLMs via outlier-aware adaptive rotations.arXiv preprint arXiv:2501.16383, 2025

    Su, Z., Chen, Z., Shen, W., Wei, H., Li, L., Yu, H., and Yuan, K. Rotatekv: Accurate and robust 2-bit kv cache quantization for llms via outlier-aware adaptive rotations, 2025. URL https://arxiv.org/abs/2501.16383

  52. [52]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Team, G., Georgiev, P., Lei, V. I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 , 2024

  53. [53]

    BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models

    Thakur, N., Reimers, N., R¨ uckl´ e, A., Srivastava, A., and Gurevych, I. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , 2021. URL https://openreview.net/forum?id=wCu6T5xFjeJ

  54. [54]

    N., Kaiser, L., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. NeurIPS, 2017. 24

  55. [55]

    High-dimensional probability: An introduction with applications in data science , volume 47

    Vershynin, R. High-dimensional probability: An introduction with applications in data science , volume 47. Cambridge university press, 2018

  56. [56]

    T., et al

    Wang, J., Zhang, T., Sebe, N., Shen, H. T., et al. A survey on learning to hash. IEEE transactions on pattern analysis and machine intelligence , 40(4):769–790, 2017

  57. [57]

    Smoothquant: Accurate and efficient post-training quantization for large language models

    Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp. 38087–38099. PMLR, 2023

  58. [58]

    Efficient Streaming Language Models with Attention Sinks

    Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453 , 2023

  59. [59]

    Y., Kim, B., Bae, J., Kwon, B., Park, G., Yang, E., Kwon, S

    Yang, J. Y., Kim, B., Bae, J., Kwon, B., Park, G., Yang, E., Kwon, S. J., and Lee, D. No token left behind: Reliable kv cache compression via importance-aware mixed precision quantization. arXiv preprint arXiv:2402.18096 , 2024

  60. [60]

    WKVQuant: Quantizing weight and key/value cache for large language models gains more

    Yue, Y., Yuan, Z., Duanmu, H., Zhou, S., Wu, J., and Nie, L. Wkvquant: Quantizing weight and key/value cache for large language models gains more. arXiv preprint arXiv:2402.12065 , 2024

  61. [61]

    Zador, P. L. Development and evaluation of procedures for quantizing multivariate distributions. Stanford University, 1964

  62. [63]

    Qjl: 1-bit quantized jl transform for kv cache quantization with zero overhead

    Zandieh, A., Daliri, M., and Han, I. Qjl: 1-bit quantized jl transform for kv cache quantization with zero overhead. arXiv preprint arXiv:2406.03482 , 2024

  63. [64]

    Subgen: Token generation in sublinear time and memory

    Zandieh, A., Han, I., Mirrokni, V., and Karbasi, A. Subgen: Token generation in sublinear time and memory. arXiv preprint arXiv:2402.06082 , 2024

  64. [65]

    Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization

    Zhang, T., Yi, J., Xu, Z., and Shrivastava, A. Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization. arXiv preprint arXiv:2405.03917 , 2024

  65. [66]

    H2o: Heavy-hitter oracle for efficient generative inference of large language models

    Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., R´ e, C., Barrett, C., et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems , 36, 2024. 25