pith. sign in

arxiv: 2605.21226 · v1 · pith:ZSS4WQHKnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI

OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

Pith reviewed 2026-05-21 05:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords KV cache compressiontransformer inferencequantizationoctahedral parametrizationsquared error optimizationrotation preconditioninglong context
0
0 comments X

The pith

OCTOPUS jointly quantizes rotated KV triplets via octahedral mapping to achieve optimal squared-error compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OCTOPUS as a compression scheme for the key-value cache that dominates memory use during long-context autoregressive inference. It applies a structured rotation to the vectors, then groups the rotated coordinates into triplets whose direction is mapped onto a square through an octahedral parameterization. The two resulting coordinates and the triplet norm are then quantized separately with Lloyd-Max scalar quantizers tuned to the marginal distributions induced by the rotation. Bit allocation across the triplet components is chosen to minimize the overall squared error for a given total bit budget, producing a non-uniform allocation that depends only on the ambient dimension. The resulting codec is data-oblivious, deterministic given a seed, and admits a fused kernel that reconstructs keys on the fly without extra memory traffic.

Core claim

Joint quantization of each rotated coordinate triplet, after mapping its direction to a square by the octahedral parameterization and allocating bits to the two projected coordinates plus the norm so as to minimize squared error, yields a KV-cache codec that equals or exceeds every prior rotation-preconditioned scalar quantizer at every reported bit width and evaluation metric, with the margin widening as the average bit rate falls.

What carries the argument

Octahedral parameterization that maps the direction of a 3-D coordinate triplet onto a 2-D square, enabling joint quantization of the two resulting coordinates together with the triplet norm under squared-error-optimal bit allocation.

If this is right

  • Lower average bit widths become usable for KV cache without proportional quality loss.
  • The same allocation rule applies across text, video, and audio decoders once the total dimension is known.
  • Fused on-the-fly reconstruction removes any added decode-time memory bandwidth.
  • The codec remains online and deterministic, requiring only a seed for reproducibility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The constant finite-dimensional optimum may indicate that high-dimensional rotation makes the marginals sufficiently universal that downstream task loss tracks squared error closely.
  • Similar octahedral grouping could be tested on other high-dimensional activations whose marginals are approximately isotropic after rotation.
  • Hardware kernels could exploit the fixed triplet structure to further reduce register pressure during dequantization.

Load-bearing premise

The squared-error bit allocation derived from the octahedral triplet mapping stays near-optimal for actual downstream quality metrics on real decoders.

What would settle it

A new model or task where exhaustive search over per-triplet bit allocations produces a different optimum than the constant allocation found by sweeps, or where the reported quality lead vanishes at low bit widths.

Figures

Figures reproduced from arXiv: 2605.21226 by Mark Boss, Shimon Vainer, Simon Donn\'e, Vikram Voleti.

Figure 1
Figure 1. Figure 1: The OCTOPUS encode pipeline. Stages 1–5 (top) realise the rotation and triplet de￾composition of Sec. 3.1–3.2: a key k is normalised (Eq. 1), preconditioned by a sign-flipped Walsh-Hadamard rotation (Eq. 2), cut into ntri = ⌈d/3⌉ triplets, and decomposed into a triplet norm ρi and a unit direction ni ∈ S 2 (Sec. 3.2). Stage 6 (middle) maps each direction onto [−1, 1]2 via the octahedral fold (Eq. 5–6); the… view at source ↗
Figure 2
Figure 2. Figure 2: Synthetic fidelity. (a) OCTOPUS-QJL is best at every bit width; OCTOPUS alone beats every non-QJL baseline. (b) OCTOPUS-QJL tracks fp32 to within 0.001; TurboQuant-QJL drops to near-uniform at b=2 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qwen2.5-7B rate-quality and needle recall. OCTOPUS does not collapse at b=2 on either PPL or retrieval; at b=4 all codecs retain baseline recall. under compressed KV. The cache recipe matches the video sweep except for the autoregressive unit and group size: residual window one native-precision scale, V group g=16, and no per-layer protection. We report LSD, log-mel MSE, SNR, and latent cosine against the … view at source ↗
Figure 4
Figure 4. Figure 4: LLM quality at fixed deployment memory. WikiText-2 perplexity vs. KV-cache memory at 32,768-token context for Qwen2.5-7B-Instruct-1M. The probe-time kv_cache_bytes is lin￾early extrapolated from the sweep’s measurement window to 32k tokens, so the x-axis is the memory budget a deployment actually pays. OCTOPUS dominates the Pareto frontier across the full memory range; OCTOPUS-QJL trails slightly because t… view at source ↗
Figure 5
Figure 5. Figure 5: Worst-case codec divergence across bit depths. Each panel shows the single frame with the highest combined cross-codec L1 divergence from the fp16 baseline (same frame index for both pipelines). Rows: b=4, 3, 2; columns: baseline and each codec. OCTOPUS remains visually faithful at every bit width; competing codecs collapse at b≤3. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗
read the original abstract

The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytically tractable marginal is a near-optimal recipe for KV compression. OCTOPUS advances this paradigm through joint quantization of rotated coordinate triplets. Each triplet's direction is mapped to a square via an octahedral parameterization, and the two resulting coordinates and the triplet norm are Lloyd-Max quantized against implementation-matched marginals. Optimizing the per-triplet squared error gives a strictly non-uniform bit allocation depending only on the total dimensionality of the keys. We find the finite-dimensional quality optimum with sweeps to be constant on every real decoder we test. The codec is data-oblivious, online, and deterministic given a seed. Across text, video, and audio, OCTOPUS matches or beats every prior rotation codec at every reported bit width and metric, with a lead that grows as bits drop for extreme compression. Furthermore, a fused Triton implementation reconstructs keys on the fly without materializing the uncompressed key, so the codec adds no decode-time bandwidth or latency over the existing dequantization. Project Page: https://octopus-quant.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces OCTOPUS, a KV-cache compression codec for autoregressive transformers that extends rotation-preconditioned scalar quantization by jointly quantizing rotated coordinate triplets via an octahedral parameterization. Each triplet is mapped to a square, after which the two in-plane coordinates and the triplet norm are Lloyd-Max quantized against implementation-matched marginals. The per-triplet squared-error optimum yields a strictly non-uniform bit allocation that depends only on total key dimensionality. Sweeps are reported to locate a constant finite-dimensional quality optimum across every real decoder tested. The codec is claimed to match or exceed prior rotation codecs (TurboQuant, PolarQuant) at every reported bit width and metric, with the advantage increasing at low bit widths; a fused Triton kernel reconstructs keys on the fly without materializing the uncompressed tensor.

Significance. If the reported constancy of the finite-dimensional optimum and the superiority at low bit widths are confirmed, the work would supply a data-oblivious, dimensionality-only recipe for KV compression that requires no per-model retuning and improves upon existing rotation codecs, especially under extreme compression. The combination of an analytically derived bit allocation, cross-modality empirical results, and a zero-overhead fused kernel would constitute a practical advance for long-context inference.

major comments (2)
  1. [§5.1] §5.1 (Experimental validation of constancy): The claim that sweeps locate a constant finite-dimensional quality optimum on every tested decoder is load-bearing for the general-applicability statement. The manuscript should either supply a structural argument showing why the optimum cannot shift under different attention marginals or architectures, or report additional sweeps on at least two untested decoder families (e.g., a non-standard attention variant or a multimodal model outside the text/video/audio set).
  2. [§3.3] §3.3 (Bit-allocation derivation): The non-uniform allocation is obtained by minimizing squared error on the octahedral triplet mapping. It is unclear whether the resulting allocation remains near-optimal once the actual per-coordinate marginals of a real decoder deviate from the assumed implementation-matched distributions; an ablation that replaces the derived allocation with a uniform one on the same rotated vectors would quantify the contribution of the optimization.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'every real decoder we test' is used without stating the number or architectural diversity of the decoders; a parenthetical listing the exact models would improve precision.
  2. [§6] §6 (Implementation): the fused Triton kernel is stated to add no decode-time bandwidth, yet no operation-count or memory-access comparison against a standard dequantization baseline is supplied; a short table would clarify the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§5.1] §5.1 (Experimental validation of constancy): The claim that sweeps locate a constant finite-dimensional quality optimum on every tested decoder is load-bearing for the general-applicability statement. The manuscript should either supply a structural argument showing why the optimum cannot shift under different attention marginals or architectures, or report additional sweeps on at least two untested decoder families (e.g., a non-standard attention variant or a multimodal model outside the text/video/audio set).

    Authors: We agree that demonstrating the stability of the finite-dimensional optimum is central to the claim of general applicability. While we lack a fully rigorous invariance theorem, a structural argument follows from the construction: the octahedral mapping and squared-error minimization operate on triplets after a fixed random rotation that equalizes coordinate statistics, and the resulting bit allocation depends only on total key dimension rather than on the specific pre-rotation marginals. Because the rotation is data-oblivious and the Lloyd-Max quantizers are matched to the post-rotation implementation distributions, the per-triplet optimum is expected to remain stable for any architecture that employs comparable rotary or equivalent preconditioning. In the revision we will insert this argument, together with a short derivation sketch, into §5.1. revision: yes

  2. Referee: [§3.3] §3.3 (Bit-allocation derivation): The non-uniform allocation is obtained by minimizing squared error on the octahedral triplet mapping. It is unclear whether the resulting allocation remains near-optimal once the actual per-coordinate marginals of a real decoder deviate from the assumed implementation-matched distributions; an ablation that replaces the derived allocation with a uniform one on the same rotated vectors would quantify the contribution of the optimization.

    Authors: The referee correctly identifies a point that merits explicit quantification. We will add an ablation that applies both the derived non-uniform allocation and a uniform allocation to identical sets of rotated triplets, keeping all other components of the codec fixed. The comparison will be reported in §3.3 (or a new short subsection) using the same models and bit-widths as the main experiments. This will directly measure the contribution of the squared-error-optimal allocation and confirm that the gain is largest at the lowest bit widths, as predicted by the derivation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The bit allocation is obtained by direct optimization of per-triplet squared error under the octahedral mapping, yielding an expression that depends only on total dimensionality as an exogenous input; this is a forward derivation rather than a fit renamed as prediction. The reported constancy of the finite-dimensional optimum is an empirical observation from decoder sweeps, not a definitional or self-referential step. No self-citation is invoked as a uniqueness theorem or load-bearing premise, and the new octahedral parameterization plus Lloyd-Max quantization against marginals are introduced independently of the target performance claims. The overall codec performance is validated against external baselines rather than reducing to the paper's own fitted quantities by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the octahedral mapping being a suitable low-distortion direction quantizer for rotated triplets and on the squared-error minimization producing a bit allocation that is near-optimal for the actual marginals encountered in real decoders.

axioms (2)
  • domain assumption Marginals after rotation are analytically tractable and well-matched by Lloyd-Max quantizers
    Invoked to justify per-component Lloyd-Max quantization of the two square coordinates and the triplet norm.
  • domain assumption The finite-dimensional quality optimum found by sweeps is constant across decoders
    Stated as an empirical finding that underpins the claim of a single, general bit-allocation rule.

pith-pipeline@v0.9.0 · 5772 in / 1418 out tokens · 46609 ms · 2026-05-21T05:23:45.040700+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

  1. [1]

    GQA: Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 4895–4901, 2023. 9

  2. [2]

    Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot: Outlier-free 4-bit inference in rotated LLMs.arXiv preprint, 2024

  3. [3]

    PyramidKV: Dynamic KV cache compression based on pyramidal information funneling.arXiv preprint, 2024

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, et al. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling.arXiv preprint, 2024

  4. [4]

    Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher M. De Sa. QuIP: 2-bit quan- tization of large language models with guarantees.Neural Information Processing Systems (NeurIPS), 36:4396–4429, 2023

  5. [5]

    Cigolle, Sam Donow, Daniel Evangelakos, Michael Mara, Morgan McGuire, and Quirin Meyer

    Zina H. Cigolle, Sam Donow, Daniel Evangelakos, Michael Mara, Morgan McGuire, and Quirin Meyer. A survey of efficient representations for independent unit vectors.Journal of Computer Graphics Techniques (JCGT), 3(2):1–30, 2014

  6. [6]

    Fu, Stefano Ermon, A

    Tri Dao, Daniel Y . Fu, Stefano Ermon, A. Rudra, and Christopher R’e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InNeural Information Processing Systems (NeurIPS), 2022

  7. [7]

    GPT3.int8(): 8-bit matrix multiplication for transformers at scale.Neural Information Processing Systems (NeurIPS), 35: 30318–30332, 2022

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. GPT3.int8(): 8-bit matrix multiplication for transformers at scale.Neural Information Processing Systems (NeurIPS), 35: 30318–30332, 2022

  8. [8]

    QAQ: Quality adaptive quantization for LLM KV cache.2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2024

    Shichen Dong, Wenfang Cheng, Jiayu Qin, and Wei Wang. QAQ: Quality adaptive quantization for LLM KV cache.2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2024

  9. [9]

    The Llama 3 herd of models.arXiv preprint, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models.arXiv preprint, 2024

  10. [10]

    Octahedron environment maps

    Thomas Engelhardt and Carsten Dachsbacher. Octahedron environment maps. InInternational Symposium on Vision, Modeling, and Visualization (VMV), 2008

  11. [11]

    GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint, 2022

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint, 2022

  12. [12]

    Data engineering for scaling language models to 128K context.arXiv preprint, 2024

    Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. Data engineering for scaling language models to 128K context.arXiv preprint, 2024

  13. [13]

    Asymptotically optimal block quantization.IEEE Transactions on Information Theory, 25(4):373–380, 1979

    Allen Gersho. Asymptotically optimal block quantization.IEEE Transactions on Information Theory, 25(4):373–380, 1979

  14. [14]

    On the structure of vector quantizers.IEEE Transactions on Information Theory, 28(2):157–166, 1982

    Allen Gersho. On the structure of vector quantizers.IEEE Transactions on Information Theory, 28(2):157–166, 1982

  15. [15]

    PolarQuant

    Insu Han, Praneeth Kacham, Amin Karbasi, Vahab Mirrokni, and Amir Zandieh. PolarQuant: Quantizing KV caches with polar transformation.arXiv preprint, 2025. Not to be confused with Wu et al. (arXiv:2502.00527), which shares the name “PolarQuant” but proposes a different method

  16. [16]

    Bal- anceKV: KV cache compression through discrepancy theory.arXiv preprint, 2025

    Insu Han, Michael Kapralov, Ekaterina Kochetkova, Kshiteej Sheth, and Amir Zandieh. Bal- anceKV: KV cache compression through discrepancy theory.arXiv preprint, 2025

  17. [17]

    Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. KVQuant: Towards 10 million context length LLM inference with KV cache quantization.arXiv preprint, 2024

  18. [18]

    RULER: What’s the real context size of your long-context language models? InProceedings of the Conference on Language Modeling (COLM), 2024

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InProceedings of the Conference on Language Modeling (COLM), 2024

  19. [19]

    Needle in a haystack — pressure testing LLMs

    Greg Kamradt. Needle in a haystack — pressure testing LLMs. https://github.com/ gkamradt/LLMTest_NeedleInAHaystack, 2023

  20. [20]

    GEAR: An efficient KV cache compression recipe for near-lossless generative inference of LLM.arXiv preprint, 2024

    Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. GEAR: An efficient KV cache compression recipe for near-lossless generative inference of LLM.arXiv preprint, 2024

  21. [21]

    Quantizing tangent frames

    Arseny Kapoulkine. Quantizing tangent frames. Blog post, https://zeux.io/2026/04/30/ quantizing-tangent-frames/, 2026. Accessed 2026-04-30. 10

  22. [22]

    Lexico: Extreme KV cache compression via sparse coding over universal dictionaries.arXiv preprint, 2024

    Junhyuck Kim, Jongho Park, Jaewoong Cho, and Dimitris Papailiopoulos. Lexico: Extreme KV cache compression via sparse coding over universal dictionaries.arXiv preprint, 2024

  23. [23]

    SnapKV: LLM knows what you are looking for before generation.arXiv preprint, 2024

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation.arXiv preprint, 2024

  24. [24]

    AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration.Proceedings of Machine Learning and Systems, 6:87–100, 2024

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration.Proceedings of Machine Learning and Systems, 6:87–100, 2024

  25. [25]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics (TACL), 12:157–173, 2024

  26. [26]

    Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time.Neural Information Processing Systems (NeurIPS), 36, 2024

    Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time.Neural Information Processing Systems (NeurIPS), 36, 2024

  27. [27]

    KIVI: A tuning-free asymmetric 2-bit quantization for KV cache.arXiv preprint, 2024

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. KIVI: A tuning-free asymmetric 2-bit quantization for KV cache.arXiv preprint, 2024

  28. [28]

    Least squares quantization in PCM.IEEE Transactions on Information Theory, 28(2):129–137, 1982

    Stuart Lloyd. Least squares quantization in PCM.IEEE Transactions on Information Theory, 28(2):129–137, 1982

  29. [29]

    Quantizing for minimum distortion.IRE Transactions on Information Theory, 6(1): 7–12, 1960

    Joel Max. Quantizing for minimum distortion.IRE Transactions on Information Theory, 6(1): 7–12, 1960

  30. [30]

    Panter and Ward Dite

    Philip F. Panter and Ward Dite. Quantization distortion in pulse-count modulation with nonuni- form spacing of levels.Proceedings of the IRE, 39(1):44–48, 1951

  31. [31]

    Efficient autoregressive audio modeling via next-scale prediction.arXiv preprint, 2024

    Kai Qiu, Xiang Li, Hao Chen, Jie Sun, Jinglu Wang, Zhe Lin, Marios Savvides, and Bhiksha Raj. Efficient autoregressive audio modeling via next-scale prediction.arXiv preprint, 2024

  32. [32]

    FlashAttention-3: Fast and accurate attention with asynchrony and low-precision.arXiv preprint, 2024

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. FlashAttention-3: Fast and accurate attention with asynchrony and low-precision.arXiv preprint, 2024

  33. [33]

    RotateKV: Accurate and robust 2-bit KV cache quantization for LLMs via outlier-aware adaptive rotations.arXiv preprint, 2025

    Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu, and Kehong Yuan. RotateKV: Accurate and robust 2-bit KV cache quantization for LLMs via outlier-aware adaptive rotations.arXiv preprint, 2025

  34. [34]

    Philippe Tillet, H. T. Kung, and David Cox. Triton: An intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2019

  35. [35]

    PolarQuant: Leveraging polar transformation for efficient key cache quantization and decoding acceleration.arXiv preprint, 2025

    Songhao Wu, Ang Lv, Xiao Feng, Yufei Zhang, Xun Zhang, Guojun Yin, Wei Lin, and Rui Yan. PolarQuant: Leveraging polar transformation for efficient key cache quantization and decoding acceleration.arXiv preprint, 2025

  36. [36]

    SmoothQuant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning (ICML), pages 38087–38099, 2023

  37. [37]

    Efficient streaming language models with attention sinks.arXiv preprint, 2023

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint, 2023

  38. [38]

    No token left behind: Reliable KV cache compression via importance-aware mixed precision quantization.arXiv preprint, 2024

    June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, and Dongsoo Lee. No token left behind: Reliable KV cache compression via importance-aware mixed precision quantization.arXiv preprint, 2024

  39. [39]

    Freeman, Frédo Durand, Eli Shechtman, and Xun Huang

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Frédo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  40. [40]

    WKVQuant: Quantizing weight and key/value cache for large language models gains more

    Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, and Liqiang Nie. WKVQuant: Quantizing weight and key/value cache for large language models gains more. arXiv preprint, 2024. 11

  41. [41]

    Zador.Development and Evaluation of Procedures for Quantizing Multivariate Distri- butions

    Paul L. Zador.Development and Evaluation of Procedures for Quantizing Multivariate Distri- butions. PhD thesis, Stanford University, 1964

  42. [42]

    QJL: 1-bit quantized JL transform for KV cache quantization with zero overhead.arXiv preprint, 2024

    Amir Zandieh, Majid Daliri, and Insu Han. QJL: 1-bit quantized JL transform for KV cache quantization with zero overhead.arXiv preprint, 2024

  43. [43]

    TurboQuant: Online vector quantization with near-optimal distortion rate.arXiv preprint, 2025

    Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. TurboQuant: Online vector quantization with near-optimal distortion rate.arXiv preprint, 2025

  44. [44]

    Efros, Eli Shechtman, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InConference on Computer Vision and Pattern Recognition (CVPR), 2018

  45. [45]

    KV cache is 1 bit per channel: Efficient large language model inference with coupled quantization.arXiv preprint, 2024

    Tianyi Zhang, Jonah Yi, Zhaozhuo Xu, and Anshumali Shrivastava. KV cache is 1 bit per channel: Efficient large language model inference with coupled quantization.arXiv preprint, 2024

  46. [46]

    H2O: Heavy-hitter oracle for efficient generative inference of large language models.Neural Information Processing Systems (NeurIPS), 36, 2024

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2O: Heavy-hitter oracle for efficient generative inference of large language models.Neural Information Processing Systems (NeurIPS), 36, 2024

  47. [47]

    Atom: Low-bit quantization for efficient and accurate LLM serving

    Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. Atom: Low-bit quantization for efficient and accurate LLM serving. InProceedings of Machine Learning and Systems, pages 196–209, 2024

  48. [48]

    ∆ MSE” and “∆(1−cos)

    Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint, 2026. 12 A Encoder and decoder algorithms Algorithm 1 gives the encoder as it is implemented: one pass per key, with all intermediate state (rotated ...