TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Amir Zandieh; Majid Daliri; Majid Hadian; Vahab Mirrokni

arxiv: 2504.19874 · v1 · pith:S4XIHWQSnew · submitted 2025-04-28 · 💻 cs.LG · cs.AI· cs.DB· cs.DS

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Amir Zandieh , Majid Daliri , Majid Hadian , Vahab Mirrokni This is my paper

Pith reviewed 2026-05-20 08:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DBcs.DS

keywords vector quantizationdistortion rateonline quantizationrandom rotationscalar quantizationKV cache quantizationnearest neighbor searchinformation theoretic bounds

0 comments

The pith

TurboQuant achieves vector quantization with distortion rates within a small constant factor of the information-theoretic minimum.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TurboQuant, a data-oblivious online algorithm for quantizing high-dimensional Euclidean vectors to minimize either mean-squared error or inner-product distortion. It works by applying a random rotation that concentrates coordinate values into a Beta distribution, then quantizing each coordinate independently with an optimal scalar quantizer. Because high-dimensional vectors have nearly independent coordinates after rotation, this simple per-coordinate approach stays close to the best possible rate. The authors prove matching lower bounds showing that no vector quantizer can do substantially better, and TurboQuant reaches within a factor of roughly 2.7 of those bounds. The method is tested on KV-cache compression and nearest-neighbor search, where it delivers near-neutral quality at 3.5 bits per channel.

Core claim

TurboQuant randomly rotates input vectors to induce a concentrated Beta distribution on coordinates and then applies optimal scalar quantizers independently to each coordinate. This yields near-optimal mean-squared-error distortion. For unbiased inner-product estimation, an MSE quantizer is followed by a 1-bit Quantized JL transform on the residual. The resulting scheme matches the proven information-theoretic lower bound on distortion rate up to a small constant factor of approximately 2.7 across all bit widths and dimensions.

What carries the argument

Random rotation that concentrates coordinates into a Beta distribution, enabling independent optimal scalar quantization per coordinate while preserving near-optimality due to high-dimensional near-independence.

If this is right

KV-cache quantization reaches absolute quality neutrality at 3.5 bits per channel.
Only marginal quality degradation appears at 2.5 bits per channel for KV caches.
Nearest-neighbor search recall improves over product quantization while indexing time drops to essentially zero.
The same scheme works online and data-obliviously for every bit width and dimension.
Any future vector quantizer must respect the same information-theoretic lower bounds established in the paper.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The rotation-plus-scalar-quantizer pattern could be combined with learned per-coordinate codebooks to tighten the constant factor in domain-specific settings.
Because no data-dependent preprocessing is required, the method fits naturally into streaming or continually changing high-dimensional data pipelines.
The same independence argument may extend to other distortion measures such as cosine similarity or Wasserstein distance once suitable scalar quantizers are identified.

Load-bearing premise

After random rotation the coordinates become sufficiently near-independent in high dimensions that separate scalar quantization loses only a small constant factor in distortion compared with any joint scheme.

What would settle it

A calculation or experiment in which TurboQuant's distortion rate exceeds three times the proven lower bound for some bit width and dimension would show the near-optimality claim is false.

read the original abstract

Vector quantization, a problem rooted in Shannon's source coding theory, aims to quantize high-dimensional Euclidean vectors while minimizing distortion in their geometric structure. We propose TurboQuant to address both mean-squared error (MSE) and inner product distortion, overcoming limitations of existing methods that fail to achieve optimal distortion rates. Our data-oblivious algorithms, suitable for online applications, achieve near-optimal distortion rates (within a small constant factor) across all bit-widths and dimensions. TurboQuant achieves this by randomly rotating input vectors, inducing a concentrated Beta distribution on coordinates, and leveraging the near-independence property of distinct coordinates in high dimensions to simply apply optimal scalar quantizers per each coordinate. Recognizing that MSE-optimal quantizers introduce bias in inner product estimation, we propose a two-stage approach: applying an MSE quantizer followed by a 1-bit Quantized JL (QJL) transform on the residual, resulting in an unbiased inner product quantizer. We also provide a formal proof of the information-theoretic lower bounds on best achievable distortion rate by any vector quantizer, demonstrating that TurboQuant closely matches these bounds, differing only by a small constant ($\approx 2.7$) factor. Experimental results validate our theoretical findings, showing that for KV cache quantization, we achieve absolute quality neutrality with 3.5 bits per channel and marginal quality degradation with 2.5 bits per channel. Furthermore, in nearest neighbor search tasks, our method outperforms existing product quantization techniques in recall while reducing indexing time to virtually zero.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TurboQuant gives a practical online quantizer via random rotation plus per-coordinate scalar steps that hits usable rates for KV-cache and NNS, but the claimed 2.7x gap to the information-theoretic bound rests on an independence assumption that residual norm constraints may stretch.

read the letter

The main point is that the authors build an online vector quantizer that first applies a random rotation to push coordinates toward concentrated Beta marginals, then uses standard scalar quantizers on each coordinate. For inner-product tasks they add a second 1-bit QJL stage on the residual to keep the estimator unbiased. They also supply a separate lower-bound argument showing that no vector quantizer can do much better than a small constant times their rate. Experiments report near-neutral KV-cache quality at 3.5 bits per channel and better recall than product quantization at essentially zero indexing cost.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes TurboQuant, a data-oblivious online vector quantization algorithm for high-dimensional Euclidean vectors. It applies a random rotation to induce concentrated Beta marginals on coordinates, then performs independent scalar quantization per coordinate. For inner-product preservation it adds a two-stage correction using an MSE quantizer followed by a 1-bit Quantized JL transform. The authors supply a formal proof of information-theoretic lower bounds on achievable distortion and claim that TurboQuant attains rates within a small constant factor (approximately 2.7) of these bounds across bit-widths and dimensions. Experiments report absolute quality neutrality at 3.5 bits per channel for KV-cache quantization and improved recall with near-zero indexing time for nearest-neighbor search.

Significance. If the central claims are substantiated, the work offers a theoretically grounded, online quantization primitive that is attractive for memory-constrained inference in large models. The combination of a data-oblivious construction, explicit lower-bound proof, and concrete task-level validation constitutes a useful contribution to the quantization literature.

major comments (2)

[Core algorithm and Beta distribution analysis] Core algorithm description and Beta-concentration analysis: the central claim that random rotation plus per-coordinate scalar quantization yields distortion within a small constant factor of the vector-quantization lower bound rests on the coordinates behaving as sufficiently near-independent after rotation. The unit-norm constraint induces negative dependence (sum x_i^2 = ||v||^2) that is not obviously canceled by the marginal concentration argument; this dependence could enlarge the mutual-information gap or error accumulation beyond the stated ≈2.7 factor, especially at moderate dimensions or low bit-widths. A quantitative bound that accounts for the residual joint dependence is needed to support the optimality claim.
[Lower-bound proof section] Formal proof of information-theoretic lower bounds: the proof is presented as independent of the algorithm construction, yet the manuscript must explicitly show how the reported empirical distortion rates are obtained relative to the derived lower bound without post-hoc fitting from the same evaluation data. Without this explicit linkage, the constant-factor comparison cannot be verified as non-circular.

minor comments (2)

[Experimental results] Experimental sections on KV-cache and nearest-neighbor search should report standard deviations or results over multiple random seeds and datasets to allow assessment of variability in the quality metrics.
[Notation and definitions] Clarify the precise normalization used for the distortion-rate quantity when comparing across bit-widths and dimensions in both theory and experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review and the recommendation for major revision. We believe the concerns can be addressed by clarifying the analysis and adding explicit comparisons, as detailed in our point-by-point responses below.

read point-by-point responses

Referee: Core algorithm and Beta distribution analysis: the central claim that random rotation plus per-coordinate scalar quantization yields distortion within a small constant factor of the vector-quantization lower bound rests on the coordinates behaving as sufficiently near-independent after rotation. The unit-norm constraint induces negative dependence (sum x_i^2 = ||v||^2) that is not obviously canceled by the marginal concentration argument; this dependence could enlarge the mutual-information gap or error accumulation beyond the stated ≈2.7 factor, especially at moderate dimensions or low bit-widths. A quantitative bound that accounts for the residual joint dependence is needed to support the optimality claim.

Authors: We agree that the unit-norm constraint introduces negative dependence among coordinates. However, for high-dimensional vectors uniformly distributed on the sphere, after a random orthogonal transformation, the coordinates are known to be approximately independent with the dependence becoming negligible as the dimension increases. Our proof leverages concentration properties of the Beta distribution for the marginals and bounds the quantization error using these marginals, with the constant factor of approximately 2.7 derived from the ratio of the vector rate-distortion function to the scalar one under the induced distribution. To strengthen this, we will add a quantitative analysis of the residual dependence in the revised manuscript, showing that it does not increase the factor beyond 2.7 for dimensions d >= 64 and bit-widths >= 2, using bounds from negative association or Stein's method. revision: partial
Referee: Formal proof of information-theoretic lower bounds: the proof is presented as independent of the algorithm construction, yet the manuscript must explicitly show how the reported empirical distortion rates are obtained relative to the derived lower bound without post-hoc fitting from the same evaluation data. Without this explicit linkage, the constant-factor comparison cannot be verified as non-circular.

Authors: The information-theoretic lower bound is derived independently in the proof section using standard techniques from rate-distortion theory applied to the distortion measures (MSE and inner-product), without reference to TurboQuant. The empirical rates are obtained by running the algorithm on test vectors and computing the average distortion directly. The factor ≈2.7 is the maximum observed ratio across our experiments and is consistent with the theoretical gap between scalar and vector quantization under Beta marginals. We will revise the manuscript to include an explicit comparison table that lists the theoretical lower bound, the achieved distortion for TurboQuant, and the ratio for each setting, making the non-circular nature clear. No post-hoc fitting was performed; the constant is an upper bound on the observed ratios. revision: yes

Circularity Check

0 steps flagged

No significant circularity; lower-bound proof and algorithm analysis are independent

full rationale

The paper derives its main result by first proving an information-theoretic lower bound on distortion for any vector quantizer, then separately analyzing TurboQuant's distortion under random rotation plus per-coordinate scalar quantization using Beta marginals. The lower-bound proof is presented as a standalone information-theoretic argument that does not invoke the specific TurboQuant construction or its fitted constants. Distortion rates for the method are obtained from explicit concentration bounds and scalar quantization error formulas rather than by fitting parameters to the same evaluation data used to report empirical performance. The near-independence assumption after rotation is an explicit modeling choice applied uniformly to both the algorithm and the gap analysis, but it does not create a definitional loop or reduce the claimed constant-factor gap to a tautology. No self-citations, ansatzes smuggled via prior work, or renaming of known results appear as load-bearing steps in the provided derivation outline.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The construction rests on standard high-dimensional concentration properties of random rotations and the near-independence of coordinates after rotation; no new free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)

domain assumption Random orthogonal transformations concentrate coordinate marginals to a Beta distribution and render distinct coordinates nearly independent in high dimensions.
Invoked to justify applying independent scalar quantizers after rotation.

pith-pipeline@v0.9.0 · 5820 in / 1427 out tokens · 40402 ms · 2026-05-20T08:03:49.206400+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DimensionForcing alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

randomly rotating input vectors, inducing a concentrated Beta distribution on coordinates, and leveraging the near-independence property of distinct coordinates in high dimensions to simply apply optimal scalar quantizers per each coordinate
IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We also provide a formal proof of the information-theoretic lower bounds on best achievable distortion rate by any vector quantizer, demonstrating that TurboQuant closely matches these bounds, differing only by a small constant (≈2.7) factor

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
cs.LG 2026-05 conditional novelty 8.0

HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.
Block-Sphere Vector Quantization
cs.LG 2026-05 unverdicted novelty 7.0

BlockQuant is a new block quantization algorithm on the sphere after random rotation that theoretically improves reconstruction MSE and expected inner-product distortion over EDEN, RabitQ, and TurboQuant.
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell ...
PrismQuant: Rate-Distortion-Optimal Vector Quantization for Gaussian-Mixture Sources
cs.IT 2026-05 conditional novelty 7.0

PrismQuant achieves near rate-distortion optimality for Gaussian-mixture sources by losslessly transmitting the mixture component label at H(C)/n bits per dimension and applying component-matched KLT plus scalar quant...
Quantizing With Randomized Hadamard Transforms: Efficient Heuristic Now Proven
cs.LG 2026-05 unverdicted novelty 7.0

Two randomized Hadamard transforms suffice to make coordinate marginals O(d^{-1/2})-close to Gaussian for most quantization methods, with three needed for vector quantization to match uniform random rotations asymptotically.
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
cs.PF 2026-05 unverdicted novelty 7.0

A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit
cs.LG 2026-04 unverdicted novelty 7.0

Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.
3DTurboQuant: Training-Free Near-Optimal Quantization for 3D Reconstruction Models
cs.CV 2026-04 conditional novelty 7.0

3DTurboQuant achieves training-free near-optimal quantization for 3DGS and DUSt3R models via random rotations inducing Beta distributions, enabling precomputed Lloyd-Max quantizers that deliver 3.5x and 7.9x compressi...
SuperLocalMemory V3.3: The Living Brain -- Biologically-Inspired Forgetting, Cognitive Quantization, and Multi-Channel Retrieval for Zero-LLM Agent Memory Systems
cs.AI 2026-04 unverdicted novelty 7.0

SuperLocalMemory V3.3 implements a cognitive memory taxonomy with mathematical forgetting and multi-channel retrieval, reaching 70.4% on LoCoMo in zero-LLM mode.
PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-...
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
cs.LG 2026-05 unverdicted novelty 6.0

OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
cs.LG 2026-05 unverdicted novelty 6.0

OSCAR achieves near-BF16 accuracy for 2-bit KV cache quantization by using offline spectral covariance-aware rotations aligned with attention, plus a custom deployable INT2 kernel compatible with paged serving.
VeriCache: Turning Lossy KV Cache into Lossless LLM Inference
cs.AR 2026-05 unverdicted novelty 6.0

VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.
Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours
cs.AR 2026-05 unverdicted novelty 6.0

Design Conductor 2.0 uses April 2026 frontier models to autonomously create a 5129-unit FP16/32 TurboQuant inference accelerator mapped to FPGA at 125 MHz in 80 hours.
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
cs.LG 2026-05 unverdicted novelty 6.0

HeadQ reduces 84-94% of excess perplexity in 2-bit key quantization by adding low-rank logit corrections in a calibration-learned query basis, with further gains from an A^2-weighted value policy.
Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant
cs.LG 2026-04 unverdicted novelty 6.0

At 4-bit budget KQV wins on KL divergence, geometric K error and 6D distance with unconditional K-V asymmetry; QKQV wins geometrically at other budgets because the Jensen-amplified variance inflation from QJL on K doe...
HARBOR: Automated Harness Optimization
cs.LG 2026-04 unverdicted novelty 6.0

HARBOR formalizes harness optimization as constrained noisy Bayesian optimization over mixed-variable spaces and reports a case study where it outperforms manual tuning on a production coding agent.
SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
cs.LG 2026-04 unverdicted novelty 6.0

Token-wise INT4 KV-cache quantization plus block-diagonal Hadamard rotation recovers nearly all accuracy lost by naive INT4 while adding zero end-to-end overhead under paged serving constraints.
Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon
cs.LG 2026-04 unverdicted novelty 6.0

Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.
eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization
cs.LG 2026-04 unverdicted novelty 6.0

eOptShrinkQ compresses KV caches to ~2.2 bits per entry via optimal spectral shrinkage and quantization, outperforming prior methods on LongBench while matching FP16 on multi-needle retrieval.
High-Rate Quantized Matrix Multiplication I
cs.IT 2026-01 unverdicted novelty 5.0

High-rate quantization theory yields accurate approximations for the distortion of absmax INT and FP schemes in generic weight-plus-activation matrix multiplication.
Hierarchical vs. Flat Iteration in Shared-Weight Transformers
cs.CL 2026-04 unverdicted novelty 4.0

Hierarchical two-speed shared-weight recurrence in Transformers shows a sharp performance gap compared to independent layer stacking in empirical language modeling tests.
ECG Foundation Models and Medical LLMs for Agentic Cardiovascular Intelligence at the Edge: A Review and Outlook
eess.SP 2026-04 unverdicted novelty 3.0

ECG foundation models for signal interpretation and medical LLMs for reasoning can be integrated into agentic systems for real-time cardiovascular intelligence on edge devices.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 22 Pith papers · 15 internal anchors

[1]

https://www.elastic.co/enterprise-search/vector-search

Elastic search., 2025. https://www.elastic.co/enterprise-search/vector-search

work page 2025
[2]

https://qdrant.tech/

Qdrant vectore search., 2025. https://qdrant.tech/

work page 2025
[3]

https://github.com/pgvector/pgvector/

Pgvector search., 2025. https://github.com/pgvector/pgvector/

work page 2025
[4]

https://www.pinecone.io/

Pinecone vectore database., 2025. https://www.pinecone.io/

work page 2025
[5]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints

Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., and Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. In Pro- ceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pp. 4895–4901, 2023

work page 2023
[7]

Claude, 2024

Anthropic. Claude, 2024. https://www.anthropic.com/news/claude-3-family

work page 2024
[8]

L., Li, B., Cameron, P., Jaggi, M., Alistarh, D., Hoefler, T., and Hensman, J

Ashkboos, S., Mohtashami, A., Croci, M. L., Li, B., Cameron, P., Jaggi, M., Alistarh, D., Hoefler, T., and Hensman, J. Quarot: Outlier-free 4-bit inference in rotated llms. arXiv preprint arXiv:2404.00456, 2024

work page arXiv 2024
[9]

and Lempitsky, V

Babenko, A. and Lempitsky, V. Additive quantization for extreme vector compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 931– 938, 2014

work page 2014
[10]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Bai, Y., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., Hou, L., Dong, Y., Tang, J., and Li, J. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Longformer: The Long-Document Transformer

Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020. 21

work page internal anchor Pith review Pith/arXiv arXiv 2004
[12]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Cai, Z., Zhang, Y., Gao, B., Liu, Y., Liu, T., Lu, K., Xiong, W., Dong, Y., Chang, B., Hu, J., et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Chee, J., Cai, Y., Kuleshov, V., and De Sa, C. M. Quip: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems , 36:4396–4429, 2023

work page 2023
[14]

Cover, T. M. Elements of information theory . John Wiley & Sons, 1999

work page 1999
[15]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y., et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. Gpt3. int8 (): 8-bit matrix mul- tiplication for transformers at scale. Advances in Neural Information Processing Systems , 35: 30318–30332, 2022

work page 2022
[17]

Qaq: Quality adaptive quantization for llm kv cache

Dong, S., Cheng, W., Qin, J., and Wang, W. Qaq: Quality adaptive quantization for llm kv cache. arXiv preprint arXiv:2403.04643 , 2024

work page arXiv 2024
[18]

The Llama 3 Herd of Models

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., and Larson, J. From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quanti- zation for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Data engineering for scaling language models to 128k context

Fu, Y., Panda, R., Niu, X., Yue, X., Hajishirzi, H., Kim, Y., and Peng, H. Data engineering for scaling language models to 128k context. arXiv preprint arXiv:2402.10171 , 2024. URL https://github.com/FranxYao/Long-Context-Data-Engineering

work page arXiv 2024
[22]

Gao, J., Gou, Y., Xu, Y., Yang, Y., Long, C., and Wong, R. C.-W. Practical and asymptotically optimal quantization of high-dimensional vectors in euclidean space for approximate nearest neighbor search. arXiv preprint arXiv:2409.09913 , 2024

work page arXiv 2024
[23]

Retrieval-Augmented Generation for Large Language Models: A Survey

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, H., and Wang, H. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Optimized product quantization for approximate near- est neighbor search

Ge, T., He, K., Ke, Q., and Sun, J. Optimized product quantization for approximate near- est neighbor search. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2946–2953, 2013

work page 2013
[25]

Asymptotically optimal block quantization

Gersho, A. Asymptotically optimal block quantization. IEEE Transactions on information theory, 25(4):373–380, 1979. 22

work page 1979
[26]

On the structure of vector quantizers

Gersho, A. On the structure of vector quantizers. IEEE Transactions on Information Theory , 28(2):157–166, 1982

work page 1982
[27]

Accelerat- ing large-scale inference with anisotropic vector quantization

Guo, R., Sun, P., Lindgren, E., Geng, Q., Simcha, D., Chern, F., and Kumar, S. Accelerat- ing large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning, pp. 3887–3896. PMLR, 2020

work page 2020
[28]

Polarquant: Quantizing kv caches with polar transformation

Han, I., Kacham, P., Karbasi, A., Mirrokni, V., and Zandieh, A. Polarquant: Quantizing kv caches with polar transformation. arXiv preprint arXiv:2502.02617 , 2025

work page arXiv 2025
[29]

Balancekv: Kv cache compression through discrepancy theory

Han, I., Kapralov, M., Kochetkova, E., Sheth, K., and Zandieh, A. Balancekv: Kv cache compression through discrepancy theory. arXiv preprint arXiv:2502.07861 , 2025

work page arXiv 2025
[30]

KVQuant: Towards 10 million context length llm inference with kv cache quantization.arXiv preprint arXiv:2401.18079, 2024

Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y. S., Keutzer, K., and Gholami, A. Kvquant: Towards 10 million context length llm inference with kv cache quanti- zation. arXiv preprint arXiv:2401.18079 , 2024

work page arXiv 2024
[31]

Product quantization for nearest neighbor search

Jegou, H., Douze, M., and Schmid, C. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence , 33(1):117–128, 2010

work page 2010
[32]

Needle in a haystack - pressure testing llms., 2023

Kamradt, G. Needle in a haystack - pressure testing llms., 2023. https://github.com/ gkamradt/LLMTest_NeedleInAHaystack

work page 2023
[33]

GEAR: An efficient KV cache compression recipe for near-lossless generative inference of LLM.arXiv preprint arXiv:2403.05527, 2024

Kang, H., Zhang, Q., Kundu, S., Jeong, G., Liu, Z., Krishna, T., and Zhao, T. Gear: An efficient kv cache compression recipefor near-lossless generative inference of llm. arXiv preprint arXiv:2403.05527, 2024

work page arXiv 2024
[34]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[35]

and Zaharia, M

Khattab, O. and Zaharia, M. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval , pp. 39–48, 2020

work page 2020
[36]

Lexico: Extreme kv cache compression via sparse coding over universal dictionaries

Kim, J., Park, J., Cho, J., and Papailiopoulos, D. Lexico: Extreme kv cache compression via sparse coding over universal dictionaries. arXiv preprint arXiv:2412.08890 , 2024

work page arXiv 2024
[37]

W., and Keutzer, K

Kim, S., Hooper, C., Gholami, A., Dong, Z., Li, X., Shen, S., Mahoney, M. W., and Keutzer, K. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629 , 2023

work page arXiv 2023
[38]

SnapKV: LLM Knows What You are Looking for Before Generation

Li, Y., Huang, Y., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D. Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration

Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems , 6:87–100, 2024

work page 2024
[40]

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time

Liu, Z., Desai, A., Liao, F., Wang, W., Xie, V., Xu, Z., Kyrillidis, A., and Shrivastava, A. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems , 36, 2024. 23

work page 2024
[41]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V., Chen, B., and Hu, X. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Least squares quantization in pcm

Lloyd, S. Least squares quantization in pcm. IEEE transactions on information theory , 28(2): 129–137, 1982

work page 1982
[43]

Quantizing for minimum distortion

Max, J. Quantizing for minimum distortion. IRE Transactions on Information Theory , 6(1): 7–12, 1960

work page 1960
[44]

and Dite, W

Panter, P. and Dite, W. Quantization distortion in pulse-count modulation with nonuniform spacing of levels. Proceedings of the IRE, 39(1):44–48, 1951

work page 1951
[45]

GloVe: Global vectors for word representation

Pennington, J., Socher, R., and Manning, C. GloVe: Global vectors for word representation. In Moschitti, A., Pang, B., and Daelemans, W. (eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pp. 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1162. URL ...

work page doi:10.3115/v1/d14-1162 2014
[46]

Colbertv2: Effective and efficient retrieval via lightweight late interaction

Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., and Zaharia, M. Colbertv2: Effective and efficient retrieval via lightweight late interaction. arXiv preprint arXiv:2112.01488 , 2021

work page arXiv 2021
[47]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision,

Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T. Flashattention- 3: Fast and accurate attention with asynchrony and low-precision. arXiv preprint arXiv:2407.08608, 2024

work page arXiv 2024
[48]

Shannon, C. E. A mathematical theory of communication. The Bell system technical journal , 27(3):379–423, 1948

work page 1948
[49]

Shannon, C. E. et al. Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec, 4(142-163):1, 1959

work page 1959
[50]

Fast Transformer Decoding: One Write-Head is All You Need

Shazeer, N. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[51]

RotateKV: Accurate and robust 2-bit KV cache quantization for LLMs via outlier-aware adaptive rotations.arXiv preprint arXiv:2501.16383, 2025

Su, Z., Chen, Z., Shen, W., Wei, H., Li, L., Yu, H., and Yuan, K. Rotatekv: Accurate and robust 2-bit kv cache quantization for llms via outlier-aware adaptive rotations, 2025. URL https://arxiv.org/abs/2501.16383

work page arXiv 2025
[52]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Team, G., Georgiev, P., Lei, V. I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models

Thakur, N., Reimers, N., R¨ uckl´ e, A., Srivastava, A., and Gurevych, I. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , 2021. URL https://openreview.net/forum?id=wCu6T5xFjeJ

work page 2021
[54]

N., Kaiser, L., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. NeurIPS, 2017. 24

work page 2017
[55]

High-dimensional probability: An introduction with applications in data science , volume 47

Vershynin, R. High-dimensional probability: An introduction with applications in data science , volume 47. Cambridge university press, 2018

work page 2018
[56]

T., et al

Wang, J., Zhang, T., Sebe, N., Shen, H. T., et al. A survey on learning to hash. IEEE transactions on pattern analysis and machine intelligence , 40(4):769–790, 2017

work page 2017
[57]

Smoothquant: Accurate and efficient post-training quantization for large language models

Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp. 38087–38099. PMLR, 2023

work page 2023
[58]

Efficient Streaming Language Models with Attention Sinks

Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Y., Kim, B., Bae, J., Kwon, B., Park, G., Yang, E., Kwon, S

Yang, J. Y., Kim, B., Bae, J., Kwon, B., Park, G., Yang, E., Kwon, S. J., and Lee, D. No token left behind: Reliable kv cache compression via importance-aware mixed precision quantization. arXiv preprint arXiv:2402.18096 , 2024

work page arXiv 2024
[60]

WKVQuant: Quantizing weight and key/value cache for large language models gains more

Yue, Y., Yuan, Z., Duanmu, H., Zhou, S., Wu, J., and Nie, L. Wkvquant: Quantizing weight and key/value cache for large language models gains more. arXiv preprint arXiv:2402.12065 , 2024

work page arXiv 2024
[61]

Zador, P. L. Development and evaluation of procedures for quantizing multivariate distributions. Stanford University, 1964

work page 1964
[63]

Qjl: 1-bit quantized jl transform for kv cache quantization with zero overhead

Zandieh, A., Daliri, M., and Han, I. Qjl: 1-bit quantized jl transform for kv cache quantization with zero overhead. arXiv preprint arXiv:2406.03482 , 2024

work page arXiv 2024
[64]

Subgen: Token generation in sublinear time and memory

Zandieh, A., Han, I., Mirrokni, V., and Karbasi, A. Subgen: Token generation in sublinear time and memory. arXiv preprint arXiv:2402.06082 , 2024

work page arXiv 2024
[65]

Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization

Zhang, T., Yi, J., Xu, Z., and Shrivastava, A. Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization. arXiv preprint arXiv:2405.03917 , 2024

work page arXiv 2024
[66]

H2o: Heavy-hitter oracle for efficient generative inference of large language models

Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., R´ e, C., Barrett, C., et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems , 36, 2024. 25

work page 2024

[1] [1]

https://www.elastic.co/enterprise-search/vector-search

Elastic search., 2025. https://www.elastic.co/enterprise-search/vector-search

work page 2025

[2] [2]

https://qdrant.tech/

Qdrant vectore search., 2025. https://qdrant.tech/

work page 2025

[3] [3]

https://github.com/pgvector/pgvector/

Pgvector search., 2025. https://github.com/pgvector/pgvector/

work page 2025

[4] [4]

https://www.pinecone.io/

Pinecone vectore database., 2025. https://www.pinecone.io/

work page 2025

[5] [5]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints

Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., and Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. In Pro- ceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pp. 4895–4901, 2023

work page 2023

[7] [7]

Claude, 2024

Anthropic. Claude, 2024. https://www.anthropic.com/news/claude-3-family

work page 2024

[8] [8]

L., Li, B., Cameron, P., Jaggi, M., Alistarh, D., Hoefler, T., and Hensman, J

Ashkboos, S., Mohtashami, A., Croci, M. L., Li, B., Cameron, P., Jaggi, M., Alistarh, D., Hoefler, T., and Hensman, J. Quarot: Outlier-free 4-bit inference in rotated llms. arXiv preprint arXiv:2404.00456, 2024

work page arXiv 2024

[9] [9]

and Lempitsky, V

Babenko, A. and Lempitsky, V. Additive quantization for extreme vector compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 931– 938, 2014

work page 2014

[10] [10]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Bai, Y., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., Hou, L., Dong, Y., Tang, J., and Li, J. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Longformer: The Long-Document Transformer

Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020. 21

work page internal anchor Pith review Pith/arXiv arXiv 2004

[12] [12]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Cai, Z., Zhang, Y., Gao, B., Liu, Y., Liu, T., Lu, K., Xiong, W., Dong, Y., Chang, B., Hu, J., et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Chee, J., Cai, Y., Kuleshov, V., and De Sa, C. M. Quip: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems , 36:4396–4429, 2023

work page 2023

[14] [14]

Cover, T. M. Elements of information theory . John Wiley & Sons, 1999

work page 1999

[15] [15]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y., et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. Gpt3. int8 (): 8-bit matrix mul- tiplication for transformers at scale. Advances in Neural Information Processing Systems , 35: 30318–30332, 2022

work page 2022

[17] [17]

Qaq: Quality adaptive quantization for llm kv cache

Dong, S., Cheng, W., Qin, J., and Wang, W. Qaq: Quality adaptive quantization for llm kv cache. arXiv preprint arXiv:2403.04643 , 2024

work page arXiv 2024

[18] [18]

The Llama 3 Herd of Models

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., and Larson, J. From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quanti- zation for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

Data engineering for scaling language models to 128k context

Fu, Y., Panda, R., Niu, X., Yue, X., Hajishirzi, H., Kim, Y., and Peng, H. Data engineering for scaling language models to 128k context. arXiv preprint arXiv:2402.10171 , 2024. URL https://github.com/FranxYao/Long-Context-Data-Engineering

work page arXiv 2024

[22] [22]

Gao, J., Gou, Y., Xu, Y., Yang, Y., Long, C., and Wong, R. C.-W. Practical and asymptotically optimal quantization of high-dimensional vectors in euclidean space for approximate nearest neighbor search. arXiv preprint arXiv:2409.09913 , 2024

work page arXiv 2024

[23] [23]

Retrieval-Augmented Generation for Large Language Models: A Survey

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, H., and Wang, H. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Optimized product quantization for approximate near- est neighbor search

Ge, T., He, K., Ke, Q., and Sun, J. Optimized product quantization for approximate near- est neighbor search. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2946–2953, 2013

work page 2013

[25] [25]

Asymptotically optimal block quantization

Gersho, A. Asymptotically optimal block quantization. IEEE Transactions on information theory, 25(4):373–380, 1979. 22

work page 1979

[26] [26]

On the structure of vector quantizers

Gersho, A. On the structure of vector quantizers. IEEE Transactions on Information Theory , 28(2):157–166, 1982

work page 1982

[27] [27]

Accelerat- ing large-scale inference with anisotropic vector quantization

Guo, R., Sun, P., Lindgren, E., Geng, Q., Simcha, D., Chern, F., and Kumar, S. Accelerat- ing large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning, pp. 3887–3896. PMLR, 2020

work page 2020

[28] [28]

Polarquant: Quantizing kv caches with polar transformation

Han, I., Kacham, P., Karbasi, A., Mirrokni, V., and Zandieh, A. Polarquant: Quantizing kv caches with polar transformation. arXiv preprint arXiv:2502.02617 , 2025

work page arXiv 2025

[29] [29]

Balancekv: Kv cache compression through discrepancy theory

Han, I., Kapralov, M., Kochetkova, E., Sheth, K., and Zandieh, A. Balancekv: Kv cache compression through discrepancy theory. arXiv preprint arXiv:2502.07861 , 2025

work page arXiv 2025

[30] [30]

KVQuant: Towards 10 million context length llm inference with kv cache quantization.arXiv preprint arXiv:2401.18079, 2024

Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y. S., Keutzer, K., and Gholami, A. Kvquant: Towards 10 million context length llm inference with kv cache quanti- zation. arXiv preprint arXiv:2401.18079 , 2024

work page arXiv 2024

[31] [31]

Product quantization for nearest neighbor search

Jegou, H., Douze, M., and Schmid, C. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence , 33(1):117–128, 2010

work page 2010

[32] [32]

Needle in a haystack - pressure testing llms., 2023

Kamradt, G. Needle in a haystack - pressure testing llms., 2023. https://github.com/ gkamradt/LLMTest_NeedleInAHaystack

work page 2023

[33] [33]

GEAR: An efficient KV cache compression recipe for near-lossless generative inference of LLM.arXiv preprint arXiv:2403.05527, 2024

Kang, H., Zhang, Q., Kundu, S., Jeong, G., Liu, Z., Krishna, T., and Zhao, T. Gear: An efficient kv cache compression recipefor near-lossless generative inference of llm. arXiv preprint arXiv:2403.05527, 2024

work page arXiv 2024

[34] [34]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[35] [35]

and Zaharia, M

Khattab, O. and Zaharia, M. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval , pp. 39–48, 2020

work page 2020

[36] [36]

Lexico: Extreme kv cache compression via sparse coding over universal dictionaries

Kim, J., Park, J., Cho, J., and Papailiopoulos, D. Lexico: Extreme kv cache compression via sparse coding over universal dictionaries. arXiv preprint arXiv:2412.08890 , 2024

work page arXiv 2024

[37] [37]

W., and Keutzer, K

Kim, S., Hooper, C., Gholami, A., Dong, Z., Li, X., Shen, S., Mahoney, M. W., and Keutzer, K. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629 , 2023

work page arXiv 2023

[38] [38]

SnapKV: LLM Knows What You are Looking for Before Generation

Li, Y., Huang, Y., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D. Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration

Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems , 6:87–100, 2024

work page 2024

[40] [40]

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time

Liu, Z., Desai, A., Liao, F., Wang, W., Xie, V., Xu, Z., Kyrillidis, A., and Shrivastava, A. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems , 36, 2024. 23

work page 2024

[41] [41]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V., Chen, B., and Hu, X. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Least squares quantization in pcm

Lloyd, S. Least squares quantization in pcm. IEEE transactions on information theory , 28(2): 129–137, 1982

work page 1982

[43] [43]

Quantizing for minimum distortion

Max, J. Quantizing for minimum distortion. IRE Transactions on Information Theory , 6(1): 7–12, 1960

work page 1960

[44] [44]

and Dite, W

Panter, P. and Dite, W. Quantization distortion in pulse-count modulation with nonuniform spacing of levels. Proceedings of the IRE, 39(1):44–48, 1951

work page 1951

[45] [45]

GloVe: Global vectors for word representation

Pennington, J., Socher, R., and Manning, C. GloVe: Global vectors for word representation. In Moschitti, A., Pang, B., and Daelemans, W. (eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pp. 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1162. URL ...

work page doi:10.3115/v1/d14-1162 2014

[46] [46]

Colbertv2: Effective and efficient retrieval via lightweight late interaction

Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., and Zaharia, M. Colbertv2: Effective and efficient retrieval via lightweight late interaction. arXiv preprint arXiv:2112.01488 , 2021

work page arXiv 2021

[47] [47]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision,

Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T. Flashattention- 3: Fast and accurate attention with asynchrony and low-precision. arXiv preprint arXiv:2407.08608, 2024

work page arXiv 2024

[48] [48]

Shannon, C. E. A mathematical theory of communication. The Bell system technical journal , 27(3):379–423, 1948

work page 1948

[49] [49]

Shannon, C. E. et al. Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec, 4(142-163):1, 1959

work page 1959

[50] [50]

Fast Transformer Decoding: One Write-Head is All You Need

Shazeer, N. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911

[51] [51]

RotateKV: Accurate and robust 2-bit KV cache quantization for LLMs via outlier-aware adaptive rotations.arXiv preprint arXiv:2501.16383, 2025

Su, Z., Chen, Z., Shen, W., Wei, H., Li, L., Yu, H., and Yuan, K. Rotatekv: Accurate and robust 2-bit kv cache quantization for llms via outlier-aware adaptive rotations, 2025. URL https://arxiv.org/abs/2501.16383

work page arXiv 2025

[52] [52]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Team, G., Georgiev, P., Lei, V. I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models

Thakur, N., Reimers, N., R¨ uckl´ e, A., Srivastava, A., and Gurevych, I. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , 2021. URL https://openreview.net/forum?id=wCu6T5xFjeJ

work page 2021

[54] [54]

N., Kaiser, L., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. NeurIPS, 2017. 24

work page 2017

[55] [55]

High-dimensional probability: An introduction with applications in data science , volume 47

Vershynin, R. High-dimensional probability: An introduction with applications in data science , volume 47. Cambridge university press, 2018

work page 2018

[56] [56]

T., et al

Wang, J., Zhang, T., Sebe, N., Shen, H. T., et al. A survey on learning to hash. IEEE transactions on pattern analysis and machine intelligence , 40(4):769–790, 2017

work page 2017

[57] [57]

Smoothquant: Accurate and efficient post-training quantization for large language models

Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp. 38087–38099. PMLR, 2023

work page 2023

[58] [58]

Efficient Streaming Language Models with Attention Sinks

Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[59] [59]

Y., Kim, B., Bae, J., Kwon, B., Park, G., Yang, E., Kwon, S

Yang, J. Y., Kim, B., Bae, J., Kwon, B., Park, G., Yang, E., Kwon, S. J., and Lee, D. No token left behind: Reliable kv cache compression via importance-aware mixed precision quantization. arXiv preprint arXiv:2402.18096 , 2024

work page arXiv 2024

[60] [60]

WKVQuant: Quantizing weight and key/value cache for large language models gains more

Yue, Y., Yuan, Z., Duanmu, H., Zhou, S., Wu, J., and Nie, L. Wkvquant: Quantizing weight and key/value cache for large language models gains more. arXiv preprint arXiv:2402.12065 , 2024

work page arXiv 2024

[61] [61]

Zador, P. L. Development and evaluation of procedures for quantizing multivariate distributions. Stanford University, 1964

work page 1964

[62] [63]

Qjl: 1-bit quantized jl transform for kv cache quantization with zero overhead

Zandieh, A., Daliri, M., and Han, I. Qjl: 1-bit quantized jl transform for kv cache quantization with zero overhead. arXiv preprint arXiv:2406.03482 , 2024

work page arXiv 2024

[63] [64]

Subgen: Token generation in sublinear time and memory

Zandieh, A., Han, I., Mirrokni, V., and Karbasi, A. Subgen: Token generation in sublinear time and memory. arXiv preprint arXiv:2402.06082 , 2024

work page arXiv 2024

[64] [65]

Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization

Zhang, T., Yi, J., Xu, Z., and Shrivastava, A. Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization. arXiv preprint arXiv:2405.03917 , 2024

work page arXiv 2024

[65] [66]

H2o: Heavy-hitter oracle for efficient generative inference of large language models

Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., R´ e, C., Barrett, C., et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems , 36, 2024. 25

work page 2024