RoPE-Aware Bit Allocation for KV-Cache Quantization

Fengfeng Liang; Jiaya Jia; Yuechen Zhang

arxiv: 2606.24033 · v1 · pith:KYRLJIA4new · submitted 2026-06-23 · 💻 cs.LG · cs.CL

RoPE-Aware Bit Allocation for KV-Cache Quantization

Fengfeng Liang , Yuechen Zhang , Jiaya Jia This is my paper

Pith reviewed 2026-06-26 01:03 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords KV-cache quantizationRoPEbit allocationlong-context inferenceattention logitsmemory compressionTurboQuant-MSE

0 comments

The pith

RoPE-aware bit allocation preserves attention logits better than uniform quantization under the same KV-cache bit budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing KV-cache quantizers treat keys as flat vectors, but RoPE decomposes each key's contribution to attention logits into position-dependent sums over two-dimensional frequency blocks. Block-GTQ therefore treats quantization as a block-wise bit-allocation problem and assigns more bits to high-energy RoPE blocks using a label-free energy score and greedy marginal-gain allocation. This requires no task labels or retraining. The method reduces per-layer mean absolute error on query-key logits by 32-80% at 2-3 bits per dimension and wins every layer comparison against uniform quantization. These logit-level gains produce substantial improvements on long-context retrieval, understanding, and reasoning tasks.

Core claim

Block-GTQ is a RoPE-aware bit allocator for key-cache quantization built on TurboQuant-MSE. For each layer and KV head it computes a label-free energy score for each RoPE block and greedily allocates integer bit widths by marginal gain. Under matched K/V bit budgets Block-GTQ better preserves RoPE query-key logits, cutting per-layer MAE by 32-80% at 2 and 3 b/dim K-only quantization and winning all 367/367 layer comparisons against uniform TQ-MSE. These fidelity gains translate to stronger downstream long-context performance.

What carries the argument

Label-free energy score per RoPE frequency block driving greedy marginal-gain bit allocation.

If this is right

Better preserves RoPE query-key logits with lower MAE on all tested layers and models.
Raises NIAH average from 70.6 to 97.4 at K2V2 on Llama-3.1-8B-Instruct.
Achieves close to fp16 scores on AIME 2024/2025 with K3V2 where uniform quantization drops to zero.
Enables 3.24x KV-cache compression with packed serving while running 1.34x faster than fp16 at 128K context.
Supports 256K and 512K contexts where fp16 runs out of memory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The energy-score proxy could be tested on other frequency-based position embeddings to check generality.
Integration with other quantization or compression methods might yield additive gains in memory efficiency.
The approach may reduce the need for fp16 recent-key buffers in long-context inference.
Deployment on edge devices could benefit from the reduced memory footprint at extended contexts.

Load-bearing premise

That the label-free energy score per RoPE block is a reliable proxy for the block's sensitivity to quantization error so that greedy allocation produces near-optimal bit widths.

What would settle it

An experiment that independently quantizes each RoPE block to different bit widths and measures the resulting change in attention logit error; if the error reduction does not track the energy scores the allocation strategy loses its justification.

Figures

Figures reproduced from arXiv: 2606.24033 by Fengfeng Liang, Jiaya Jia, Yuechen Zhang.

**Figure 1.** Figure 1: RoPE-block allocation. (a) The RoPE attention logit q ⊤R∆k decomposes into a sum over two-dimensional frequency blocks. (b) Per-block energy scores for one Qwen3-8B KV head (layer 10, head 4); scores are median-normalized for display and span orders of magnitude. (c) Under the same average bit width ¯b = 3 (dashed line), Block-GTQ reallocates bits from low-energy blocks to high-energy blocks instead of usi… view at source ↗

**Figure 2.** Figure 2: Packed-cache serving path. Persistent HBM stores packed K/V code streams plus norms and metadata. The fused attention kernel decodes only the current tile into kernel-local temporaries and consumes them directly in QK and PV products, avoiding a resident decoded fp16 KV cache. long contexts are split along the key axis and recombined with an exact log-sum-exp merge. The dequantized K/V stay in registers as… view at source ↗

**Figure 3.** Figure 3: Allocator fingerprint at the 3 b/dim budget. One subplot per model; each vertical slice is one layer, with stacked color bands giving its bit-width distribution (1b red → 8b green). 2-bit 3-bit 4-bit 10 1 Softmax KL (log lower is better) (a) No-buffer Softmax-KL across models Method Block-GTQ KIVI-ScaleOnly TQ-MSE 10 2 10 1 10 0 Softmax KL (per-model, log lower is better) 40 50 60 70 80 90 Top-10 attention… view at source ↗

**Figure 4.** Figure 4: Mean softmax KL and Top-10 attended-token overlap across models. K-only quantization (V stays fp16). (a) Mean softmax KL versus fp16 per method, averaged over the ten-model panel at 2, 3, and 4 b/dim budgets. (b) Top-10 attended-token overlap versus softmax KL, one marker per model (color: method, shape: bit-rate); the upper-left corner is best. See Appendix B.4 for the KIVI no-buffer setting. 6.2 Calibra… view at source ↗

**Figure 5.** Figure 5: NIAH single-needle retrieval on Llama-3.1-8B-Instruct. Pass rate is shown over context length (4K–128K) and needle depth (0%–100%), averaged over three trials per cell. The two rows use the same method layout: fp16, KIVI-ScaleOnly (Appendix B.4), TQ-MSE, and Block-GTQ. NIAH. NIAH [15] probes where retrieval breaks across context length and needle depth. On Llama-3.1-8B-Instruct, [PITH_FULL_IMAGE:figures/f… view at source ↗

**Figure 6.** Figure 6: Kernel-optimization gains, decode latency, and peak memory for Block-GTQ on Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Allocator fingerprint at the 2 b/dim budget. Per-layer bit-width distribution for each model. The numeric counterpart is [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Per-layer Shannon entropy H(ℓ) of the bit-width histogram at 3 b/dim across normalized layer depth (0 = first layer, 1 = last layer). Each curve is one model; per-model means and extrema are listed in [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: NIAH single-needle retrieval on Qwen2.5-7B-Instruct. Pass rate is shown over context length (4K–128K) and needle depth (0%–100%), averaged over three trials per cell. The two rows use the same method layout as [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

read the original abstract

Existing low-bit KV-cache quantizers often treat each cached key as a flat vector. Under RoPE, however, a key's contribution to a future attention logit decomposes into a position-dependent sum over two-dimensional frequency blocks. This makes key-cache quantization a block-wise bit-allocation problem: high-energy RoPE blocks are more sensitive to quantization error and should receive more bits. We introduce Block-GTQ, a RoPE-aware bit allocator for key-cache quantization built on TurboQuant-MSE(TQ-MSE). For each layer and KV head, Block-GTQ computes a label-free energy score for each RoPE block and greedily allocates integer bit widths by marginal gain. Under matched K/V bit budgets, Block-GTQ better preserves RoPE query-key logits on a ten-model diagnostic panel, cutting per-layer MAE by 32-80% at 2 and 3 b/dim K-only quantization and winning all 367/367 layer comparisons against uniform TQ-MSE. These fidelity gains translate to stronger downstream long-context retrieval, understanding, and reasoning. At K2V2 on Llama-3.1-8B-Instruct, Block-GTQ raises the six-task NIAH average from 70.6 to 97.4, and the LongBench-EN average from 36.87 to 53.31. On AIME 2024/2025 with DeepSeek-R1-Distill-Qwen-7B, without an fp16 recent-key buffer, Block-GTQ at K3V2 scores 51.7/37.5, close to fp16's 54.2/37.9, whereas uniform TQ-MSE collapses to 0.0/0.0. We further implement a packed-cache serving path. On a single H800 GPU with Qwen2.5-3B-Instruct, packed K3V3 achieves 3.24x KV-cache compression with fp16-comparable quality, runs 1.34x faster than fp16 FlashAttention2 at 128K context, reduces peak memory from 56.31 GB to 19.85 GB, and remains feasible at 256K and 512K where fp16 OOMs. Code is available at https://github.com/JIA-Lab-research/blockgtq.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Block-GTQ delivers measurable logit and task gains via RoPE-block bit allocation, but the energy score's link to actual sensitivity is not directly tested.

read the letter

Block-GTQ treats KV-cache quantization as a per-RoPE-frequency-block bit allocation problem. It computes a label-free energy score per block, then greedily assigns integer bits by marginal gain on top of TQ-MSE. Under matched budgets this cuts per-layer MAE on query-key logits by 32-80% at 2-3 bits and wins every one of the 367 layer comparisons against uniform TQ-MSE. The fidelity lift shows up in stronger NIAH, LongBench, and AIME numbers, and the packed implementation gives 3.24x compression with speed and memory wins at 128K-512K context.

The concrete advance is the explicit tie to RoPE's two-dimensional frequency blocks instead of flat-vector quantization. The empirical panel across ten models and the end-to-end serving numbers are useful for anyone shipping long-context inference.

The soft spot is the missing diagnostic on the energy score itself. The method assumes the score ranks blocks by sensitivity to quantization error, yet the abstract reports no direct check that higher-energy blocks actually suffer larger logit MAE when quantized. Without that correlation plot or ranking test, the observed gains could come from any non-uniform allocation rather than the RoPE-specific choice. If the full paper contains this verification it would tighten the central claim; if not, the justification stays indirect.

The work is aimed at practitioners who optimize LLM serving under memory constraints. A reader already working on KV quantization or long-context deployment will find the numbers and the open code directly usable. The results are grounded enough and the practical payoff is clear enough that the paper deserves a serious referee rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes Block-GTQ, a RoPE-aware bit allocator for KV-cache quantization built on TurboQuant-MSE. For each layer and KV head it computes a label-free energy score per RoPE frequency block and performs greedy marginal-gain integer bit allocation. Under matched K/V budgets the method is reported to reduce per-layer MAE on preserved query-key logits by 32-80% at 2-3 b/dim K-only quantization, win all 367/367 layer comparisons against uniform TQ-MSE on a ten-model panel, and translate those fidelity gains into large improvements on long-context retrieval (NIAH), understanding (LongBench), and reasoning (AIME) tasks, plus practical serving speedups and memory reductions at 128K-512K contexts.

Significance. If the energy-score proxy is validated, the work supplies a concrete, label-free mechanism for non-uniform KV quantization that respects the block-wise structure induced by RoPE. The scale of the diagnostic panel (ten models, 367 layers) and the downstream task gains (e.g., NIAH 70.6→97.4 at K2V2 on Llama-3.1-8B) would be practically relevant for long-context serving; the public code link is a clear strength.

major comments (2)

[§3, §4.1] §3 (method) and §4.1 (diagnostics): the headline claim that Block-GTQ’s gains are due to RoPE-aware allocation rather than merely non-uniform bits rests on the energy score being a monotonic proxy for each block’s contribution to query-key logit error. No diagnostic is reported that ranks blocks by the energy score and then measures the resulting per-block MAE under quantization; without this correlation check the 32-80% MAE reduction and 367/367 win rate could be explained by any non-uniform allocator.
[Table 2, Figure 4] Table 2 / Figure 4 (downstream results): the reported NIAH and LongBench gains at K2V2 and K3V2 are large, but the paper does not state whether the uniform TQ-MSE baseline also received the same packed-cache serving path and recent-key buffer policy; any mismatch would inflate the apparent benefit of the allocator itself.

minor comments (2)

[§3.2] Notation for the energy score E_b and the marginal-gain function should be introduced with a single equation block rather than scattered across prose.
[§4] The ten-model diagnostic panel is described only by name; a table listing model sizes, layer counts, and head dimensions would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [§3, §4.1] §3 (method) and §4.1 (diagnostics): the headline claim that Block-GTQ’s gains are due to RoPE-aware allocation rather than merely non-uniform bits rests on the energy score being a monotonic proxy for each block’s contribution to query-key logit error. No diagnostic is reported that ranks blocks by the energy score and then measures the resulting per-block MAE under quantization; without this correlation check the 32-80% MAE reduction and 367/367 win rate could be explained by any non-uniform allocator.

Authors: We agree that an explicit per-block correlation between the energy score and quantization-induced MAE would provide stronger validation of the RoPE-aware mechanism. While the energy score is derived directly from the RoPE block decomposition of the query-key inner product and the greedy allocator is designed to allocate bits proportionally to these scores, the current diagnostics focus on aggregate MAE and layer-wise win rates. In the revised manuscript we will add a new diagnostic (new figure in §4.1) that ranks RoPE blocks by energy score within each head/layer and reports the corresponding per-block MAE under both uniform TQ-MSE and Block-GTQ quantization, thereby confirming the monotonic relationship. revision: yes
Referee: [Table 2, Figure 4] Table 2 / Figure 4 (downstream results): the reported NIAH and LongBench gains at K2V2 and K3V2 are large, but the paper does not state whether the uniform TQ-MSE baseline also received the same packed-cache serving path and recent-key buffer policy; any mismatch would inflate the apparent benefit of the allocator itself.

Authors: All reported downstream results, including those for the uniform TQ-MSE baseline, were obtained using the identical packed-cache serving path and recent-key buffer policy described in §5.3. We will insert an explicit clarifying sentence in the revised §5.2 and §5.3 to state this equivalence and thereby remove any ambiguity. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper proposes Block-GTQ as a new RoPE-aware bit allocator that computes a label-free energy score per RoPE block and uses greedy marginal-gain allocation. This is evaluated empirically against the uniform TQ-MSE baseline on logit MAE and downstream tasks across multiple models. The energy score is not derived from or fitted to the evaluation metrics, and the performance gains are measured externally rather than being tautological. No self-definitional loops, fitted predictions, or load-bearing self-citations reduce the claims to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that RoPE decomposes key contributions into independent 2D frequency blocks whose energy predicts quantization sensitivity. No free parameters or invented physical entities are described in the abstract.

axioms (1)

domain assumption Under RoPE, a key's contribution to a future attention logit decomposes into a position-dependent sum over two-dimensional frequency blocks.
This decomposition is the stated reason that key-cache quantization becomes a block-wise bit-allocation problem.

invented entities (1)

Block-GTQ no independent evidence
purpose: RoPE-aware greedy bit allocator for key-cache quantization
New method introduced by the paper; no independent evidence outside the reported experiments is provided in the abstract.

pith-pipeline@v0.9.1-grok · 5968 in / 1517 out tokens · 23044 ms · 2026-06-26T01:03:15.022789+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 4 canonical work pages

[1]

LongBench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages ...

2024
[2]

Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. InConference on Language Modeling (COLM),
[3]

Abdelfattah, and Kai-Chiang Wu

Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S. Abdelfattah, and Kai-Chiang Wu. Palu: Kv-cache compression with low-rank projection. InInternational Conference on Learning Representations (ICLR), 2025

2025
[4]

Lon- glora: Efficient fine-tuning of long-context large language models

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Lon- glora: Efficient fine-tuning of long-context large language models. InInternational Conference on Learning Representations (ICLR), 2024

2024
[5]

MagicPIG: Lsh sampling for efficient llm generation

Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Léon Bottou, Zhihao Jia, and Beidi Chen. MagicPIG: Lsh sampling for efficient llm generation. InInternational Conference on Learning Representations (ICLR), 2025

2025
[6]

Abdelfattah, and Diana Marculescu

Hung-Yueh Chiang, Chi-Chih Chang, Yu-Chen Lu, Chien-Yu Lin, Kai-Chiang Wu, Mohamed S. Abdelfattah, and Diana Marculescu. UniQL: Unified quantization and low-rank compression for adaptive edge llms. InInternational Conference on Learning Representations (ICLR), 2026. 11

2026
[7]

FlashAttention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

Pith/arXiv arXiv 2023
[8]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022
[9]

DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025
[10]

Kevin Zhou

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S. Kevin Zhou. Ada-KV: Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2407.11550

Pith/arXiv arXiv 2025
[11]

Model tells you what to discard: Adaptive KV cache compression for LLMs

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive KV cache compression for LLMs. InInternational Conference on Learning Representations (ICLR), 2024

2024
[12]

Polarquant: Quantizing kv caches with polar transformation

Insu Han, Praneeth Kacham, Vahab Mirrokni, Amir Zandieh, and Amin Karbasi. Polarquant: Quantizing kv caches with polar transformation. InInternational Conference on Machine Learning (ICML), 2025

2025
[13]

Zipcache: Accurate and efficient kv cache quantization with salient token identification

Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. Zipcache: Accurate and efficient kv cache quantization with salient token identification. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[14]

Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[15]

RULER: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024

Pith/arXiv arXiv 2024
[16]

GEAR: An efficient KV cache compression recipe for near-lossless generative inference of LLM.arXiv preprint arXiv:2403.05527, 2024

Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. GEAR: An efficient KV cache compression recipe for near-lossless generative inference of LLM.arXiv preprint arXiv:2403.05527, 2024

arXiv 2024
[17]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

2023
[18]

CommVQ: Commutative vector quantization for KV cache compression

Junyan Li, Yang Zhang, Muhammad Yusuf Hassan, Talha Chafekar, Tianle Cai, Zhile Ren, Pengsheng Guo, Foroozan Karimzadeh, Colorado Reed, Chong Wang, and Chuang Gan. CommVQ: Commutative vector quantization for KV cache compression. InPro- ceedings of the 42nd International Conference on Machine Learning, volume 267 ofPro- ceedings of Machine Learning Resear...

2025
[19]

Snapkv: Llm knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[20]

Mini- Cache: KV cache compression in depth dimension for large language models.arXiv preprint arXiv:2405.14366, 2024

Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. Mini- Cache: KV cache compression in depth dimension for large language models.arXiv preprint arXiv:2405.14366, 2024

arXiv 2024
[21]

PM-KVQ: Progressive mixed-precision kv cache quantization for long-cot llms

Tengxuan Liu, Shiyao Li, Jiayi Yang, Tianchen Zhao, Feng Zhou, Xiaohui Song, Guohao Dai, Shengen Yan, Huazhong Yang, and Yu Wang. PM-KVQ: Progressive mixed-precision kv cache quantization for long-cot llms. InInternational Conference on Learning Representations (ICLR), 2026. arXiv:2505.18610; code:https://github.com/thu-nics/PM-KVQ. 12

arXiv 2026
[22]

Mellette, Alex Forencich, Rukshani Athapathu, Alex C

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. CacheGen: KV cache compression and streaming for fast language model serving. InProceedings of the ACM SIGCOMM 2024 Conference, 2024. doi: 10.1145/3651890.3672274

work page doi:10.1145/3651890.3672274 2024
[23]

Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time.arXiv preprint arXiv:2305.17118, 2023

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time.arXiv preprint arXiv:2305.17118, 2023

arXiv 2023
[24]

Kivi: A tuning-free asymmetric 2bit quantization for kv cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. InForty-first International Conference on Machine Learning (ICML), 2024

2024
[25]

TriAttention: Efficient long reasoning with trigonometric KV compression.arXiv preprint arXiv:2604.04921, 2026

Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, and Yukang Chen. TriAttention: Efficient long reasoning with trigonometric KV compression.arXiv preprint arXiv:2604.04921, 2026

Pith/arXiv arXiv 2026
[26]

Yarn: Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. InInternational Conference on Learning Repre- sentations (ICLR), 2024

2024
[27]

Efficiently scaling transformer inference

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. arXiv preprint arXiv:2211.05102, 2022

arXiv 2022
[28]

Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y . Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. FlexGen: High-throughput generative inference of large language models with a single GPU.arXiv preprint arXiv:2303.06865, 2023

arXiv 2023
[29]

Cache me if you must: Adaptive key-value quantization for large language models

Alina Shutova, Vladimir Malinovskii, Vage Egiazarian, Denis Kuznedelev, Denis Mazur, Nikita Surkov, Ivan Ermakov, and Dan Alistarh. Cache me if you must: Adaptive key-value quantization for large language models. InForty-second International Conference on Machine Learning (ICML), 2025

2025
[30]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[31]

Accurate kv cache quantization with outlier tokens trac- ing

Yi Su, Yuechi Zhou, Quantong Qiu, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhe- feng Wang, and Min Zhang. Accurate kv cache quantization with outlier tokens trac- ing. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12895–12915, Vienna, Austria, 2025. Associ- ation for Computati...

work page doi:10.18653/v1/2025.acl-long.631 2025
[32]

KVSink: Understanding and enhancing the preservation of attention sinks in kv cache quantization for llms

Zunhai Su and Kehong Yuan. KVSink: Understanding and enhancing the preservation of attention sinks in kv cache quantization for llms. InConference on Language Modeling (COLM), 2025

2025
[33]

RotateKV: Accurate and robust 2-bit KV cache quantization for LLMs via outlier-aware adaptive rotations

Zunhai Su, Hanyu Wei, Zhe Chen, Wang Shen, Linge Li, Huangqi Yu, and Kehong Yuan. RotateKV: Accurate and robust 2-bit KV cache quantization for LLMs via outlier-aware adaptive rotations. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI), pages 6200–6208, 2025. doi: 10.24963/ijcai.2025/690. URL https://www...

work page doi:10.24963/ijcai.2025/690 2025
[34]

MoQAE: Mixed-precision quantization for long-context llm inference via mixture of quantization-aware experts

Wei Tao, Haocheng Lu, Xiaoyang Qu, Bin Zhang, Kai Lu, Jiguang Wan, and Jianzong Wang. MoQAE: Mixed-precision quantization for long-context llm inference via mixture of quantization-aware experts. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10810–10820, Vienna, Austria,
[35]

doi: 10.18653/v1/2025.acl-long.531

Association for Computational Linguistics. doi: 10.18653/v1/2025.acl-long.531. URL https://aclanthology.org/2025.acl-long.531/. 13

work page doi:10.18653/v1/2025.acl-long.531 2025
[36]

SQuat: Subspace-orthogonal kv cache quantization.arXiv preprint arXiv:2503.24358, 2025

Hao Wang, Ligong Han, Kai Xu, and Akash Srivastava. SQuat: Subspace-orthogonal kv cache quantization.arXiv preprint arXiv:2503.24358, 2025

arXiv 2025
[37]

Kitty: Accurate and efficient 2-bit kv cache quantization with dynamic channel-wise precision boost.arXiv preprint arXiv:2511.18643,

Haojun Xia, Xiaoxia Wu, Jisen Li, Robert Wu, Junxiong Wang, Jue Wang, Chenxi Li, Aman Singhal, Alay Dilipbhai Shah, Alpay Ariyak, Donglin Zhuang, Zhongzhu Zhou, Ben Athi- waratkun, Zhen Zheng, and Shuaiwen Leon Song. Kitty: Accurate and efficient 2-bit kv cache quantization with dynamic channel-wise precision boost.arXiv preprint arXiv:2511.18643,

arXiv
[38]

Code:https://github.com/Summer-Summer/Kitty
[39]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

2024
[40]

RAP: KV-cache compres- sion via RoPE-aligned pruning.arXiv preprint arXiv:2602.02599, 2026

Jihao Xin, Tian Lyu, David Keyes, Hatem Ltaief, and Marco Canini. RAP: KV-cache compres- sion via RoPE-aligned pruning.arXiv preprint arXiv:2602.02599, 2026

arXiv 2026
[41]

No token left behind: Reliable kv cache compression via importance-aware mixed precision quantization.arXiv preprint arXiv:2402.18096, 2024

June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, and Dongsoo Lee. No token left behind: Reliable kv cache compression via importance-aware mixed precision quantization.arXiv preprint arXiv:2402.18096, 2024

arXiv 2024
[42]

SubGen: Token generation in sublinear time and memory.arXiv preprint arXiv:2402.06082, 2024

Amir Zandieh, Insu Han, Vahab Mirrokni, and Amin Karbasi. SubGen: Token generation in sublinear time and memory.arXiv preprint arXiv:2402.06082, 2024

arXiv 2024
[43]

Turboquant: Online vec- tor quantization with near-optimal distortion rate

Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. Turboquant: Online vec- tor quantization with near-optimal distortion rate. InInternational Conference on Learning Representations (ICLR), 2026

2026
[44]

MixKVQ: Query- aware mixed-precision kv cache quantization for long-context reasoning.arXiv preprint arXiv:2512.19206, 2025

Tao Zhang, Ziqian Zeng, Hao Peng, Huiping Zhuang, and Cen Chen. MixKVQ: Query- aware mixed-precision kv cache quantization for long-context reasoning.arXiv preprint arXiv:2512.19206, 2025

arXiv 2025
[45]

Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization

Tianyi Zhang, Jonah Yi, Zhaozhuo Xu, and Anshumali Shrivastava. Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[46]

H2O: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[47]

Geometry

Yuhao Zhou, Sirui Song, Boyang Liu, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Zhihao Zhang, Wei Li, and Xuanjing Huang. EliteKV: Scalable KV cache compression via RoPE frequency selection and joint low-rank projection.arXiv preprint arXiv:2503.01586, 2025. Appendix Roadmap Appendix A collects proofs for Block-GTQ (error bound, block weight, greedy optimality)....

arXiv 2025

[1] [1]

LongBench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages ...

2024

[2] [2]

Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. InConference on Language Modeling (COLM),

[3] [3]

Abdelfattah, and Kai-Chiang Wu

Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S. Abdelfattah, and Kai-Chiang Wu. Palu: Kv-cache compression with low-rank projection. InInternational Conference on Learning Representations (ICLR), 2025

2025

[4] [4]

Lon- glora: Efficient fine-tuning of long-context large language models

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Lon- glora: Efficient fine-tuning of long-context large language models. InInternational Conference on Learning Representations (ICLR), 2024

2024

[5] [5]

MagicPIG: Lsh sampling for efficient llm generation

Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Léon Bottou, Zhihao Jia, and Beidi Chen. MagicPIG: Lsh sampling for efficient llm generation. InInternational Conference on Learning Representations (ICLR), 2025

2025

[6] [6]

Abdelfattah, and Diana Marculescu

Hung-Yueh Chiang, Chi-Chih Chang, Yu-Chen Lu, Chien-Yu Lin, Kai-Chiang Wu, Mohamed S. Abdelfattah, and Diana Marculescu. UniQL: Unified quantization and low-rank compression for adaptive edge llms. InInternational Conference on Learning Representations (ICLR), 2026. 11

2026

[7] [7]

FlashAttention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

Pith/arXiv arXiv 2023

[8] [8]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022

[9] [9]

DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025

[10] [10]

Kevin Zhou

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S. Kevin Zhou. Ada-KV: Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2407.11550

Pith/arXiv arXiv 2025

[11] [11]

Model tells you what to discard: Adaptive KV cache compression for LLMs

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive KV cache compression for LLMs. InInternational Conference on Learning Representations (ICLR), 2024

2024

[12] [12]

Polarquant: Quantizing kv caches with polar transformation

Insu Han, Praneeth Kacham, Vahab Mirrokni, Amir Zandieh, and Amin Karbasi. Polarquant: Quantizing kv caches with polar transformation. InInternational Conference on Machine Learning (ICML), 2025

2025

[13] [13]

Zipcache: Accurate and efficient kv cache quantization with salient token identification

Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. Zipcache: Accurate and efficient kv cache quantization with salient token identification. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[14] [14]

Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[15] [15]

RULER: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024

Pith/arXiv arXiv 2024

[16] [16]

GEAR: An efficient KV cache compression recipe for near-lossless generative inference of LLM.arXiv preprint arXiv:2403.05527, 2024

Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. GEAR: An efficient KV cache compression recipe for near-lossless generative inference of LLM.arXiv preprint arXiv:2403.05527, 2024

arXiv 2024

[17] [17]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

2023

[18] [18]

CommVQ: Commutative vector quantization for KV cache compression

Junyan Li, Yang Zhang, Muhammad Yusuf Hassan, Talha Chafekar, Tianle Cai, Zhile Ren, Pengsheng Guo, Foroozan Karimzadeh, Colorado Reed, Chong Wang, and Chuang Gan. CommVQ: Commutative vector quantization for KV cache compression. InPro- ceedings of the 42nd International Conference on Machine Learning, volume 267 ofPro- ceedings of Machine Learning Resear...

2025

[19] [19]

Snapkv: Llm knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[20] [20]

Mini- Cache: KV cache compression in depth dimension for large language models.arXiv preprint arXiv:2405.14366, 2024

Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. Mini- Cache: KV cache compression in depth dimension for large language models.arXiv preprint arXiv:2405.14366, 2024

arXiv 2024

[21] [21]

PM-KVQ: Progressive mixed-precision kv cache quantization for long-cot llms

Tengxuan Liu, Shiyao Li, Jiayi Yang, Tianchen Zhao, Feng Zhou, Xiaohui Song, Guohao Dai, Shengen Yan, Huazhong Yang, and Yu Wang. PM-KVQ: Progressive mixed-precision kv cache quantization for long-cot llms. InInternational Conference on Learning Representations (ICLR), 2026. arXiv:2505.18610; code:https://github.com/thu-nics/PM-KVQ. 12

arXiv 2026

[22] [22]

Mellette, Alex Forencich, Rukshani Athapathu, Alex C

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. CacheGen: KV cache compression and streaming for fast language model serving. InProceedings of the ACM SIGCOMM 2024 Conference, 2024. doi: 10.1145/3651890.3672274

work page doi:10.1145/3651890.3672274 2024

[23] [23]

Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time.arXiv preprint arXiv:2305.17118, 2023

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time.arXiv preprint arXiv:2305.17118, 2023

arXiv 2023

[24] [24]

Kivi: A tuning-free asymmetric 2bit quantization for kv cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. InForty-first International Conference on Machine Learning (ICML), 2024

2024

[25] [25]

TriAttention: Efficient long reasoning with trigonometric KV compression.arXiv preprint arXiv:2604.04921, 2026

Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, and Yukang Chen. TriAttention: Efficient long reasoning with trigonometric KV compression.arXiv preprint arXiv:2604.04921, 2026

Pith/arXiv arXiv 2026

[26] [26]

Yarn: Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. InInternational Conference on Learning Repre- sentations (ICLR), 2024

2024

[27] [27]

Efficiently scaling transformer inference

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. arXiv preprint arXiv:2211.05102, 2022

arXiv 2022

[28] [28]

Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y . Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. FlexGen: High-throughput generative inference of large language models with a single GPU.arXiv preprint arXiv:2303.06865, 2023

arXiv 2023

[29] [29]

Cache me if you must: Adaptive key-value quantization for large language models

Alina Shutova, Vladimir Malinovskii, Vage Egiazarian, Denis Kuznedelev, Denis Mazur, Nikita Surkov, Ivan Ermakov, and Dan Alistarh. Cache me if you must: Adaptive key-value quantization for large language models. InForty-second International Conference on Machine Learning (ICML), 2025

2025

[30] [30]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[31] [31]

Accurate kv cache quantization with outlier tokens trac- ing

Yi Su, Yuechi Zhou, Quantong Qiu, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhe- feng Wang, and Min Zhang. Accurate kv cache quantization with outlier tokens trac- ing. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12895–12915, Vienna, Austria, 2025. Associ- ation for Computati...

work page doi:10.18653/v1/2025.acl-long.631 2025

[32] [32]

KVSink: Understanding and enhancing the preservation of attention sinks in kv cache quantization for llms

Zunhai Su and Kehong Yuan. KVSink: Understanding and enhancing the preservation of attention sinks in kv cache quantization for llms. InConference on Language Modeling (COLM), 2025

2025

[33] [33]

RotateKV: Accurate and robust 2-bit KV cache quantization for LLMs via outlier-aware adaptive rotations

Zunhai Su, Hanyu Wei, Zhe Chen, Wang Shen, Linge Li, Huangqi Yu, and Kehong Yuan. RotateKV: Accurate and robust 2-bit KV cache quantization for LLMs via outlier-aware adaptive rotations. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI), pages 6200–6208, 2025. doi: 10.24963/ijcai.2025/690. URL https://www...

work page doi:10.24963/ijcai.2025/690 2025

[34] [34]

MoQAE: Mixed-precision quantization for long-context llm inference via mixture of quantization-aware experts

Wei Tao, Haocheng Lu, Xiaoyang Qu, Bin Zhang, Kai Lu, Jiguang Wan, and Jianzong Wang. MoQAE: Mixed-precision quantization for long-context llm inference via mixture of quantization-aware experts. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10810–10820, Vienna, Austria,

[35] [35]

doi: 10.18653/v1/2025.acl-long.531

Association for Computational Linguistics. doi: 10.18653/v1/2025.acl-long.531. URL https://aclanthology.org/2025.acl-long.531/. 13

work page doi:10.18653/v1/2025.acl-long.531 2025

[36] [36]

SQuat: Subspace-orthogonal kv cache quantization.arXiv preprint arXiv:2503.24358, 2025

Hao Wang, Ligong Han, Kai Xu, and Akash Srivastava. SQuat: Subspace-orthogonal kv cache quantization.arXiv preprint arXiv:2503.24358, 2025

arXiv 2025

[37] [37]

Kitty: Accurate and efficient 2-bit kv cache quantization with dynamic channel-wise precision boost.arXiv preprint arXiv:2511.18643,

Haojun Xia, Xiaoxia Wu, Jisen Li, Robert Wu, Junxiong Wang, Jue Wang, Chenxi Li, Aman Singhal, Alay Dilipbhai Shah, Alpay Ariyak, Donglin Zhuang, Zhongzhu Zhou, Ben Athi- waratkun, Zhen Zheng, and Shuaiwen Leon Song. Kitty: Accurate and efficient 2-bit kv cache quantization with dynamic channel-wise precision boost.arXiv preprint arXiv:2511.18643,

arXiv

[38] [38]

Code:https://github.com/Summer-Summer/Kitty

[39] [39]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

2024

[40] [40]

RAP: KV-cache compres- sion via RoPE-aligned pruning.arXiv preprint arXiv:2602.02599, 2026

Jihao Xin, Tian Lyu, David Keyes, Hatem Ltaief, and Marco Canini. RAP: KV-cache compres- sion via RoPE-aligned pruning.arXiv preprint arXiv:2602.02599, 2026

arXiv 2026

[41] [41]

No token left behind: Reliable kv cache compression via importance-aware mixed precision quantization.arXiv preprint arXiv:2402.18096, 2024

June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, and Dongsoo Lee. No token left behind: Reliable kv cache compression via importance-aware mixed precision quantization.arXiv preprint arXiv:2402.18096, 2024

arXiv 2024

[42] [42]

SubGen: Token generation in sublinear time and memory.arXiv preprint arXiv:2402.06082, 2024

Amir Zandieh, Insu Han, Vahab Mirrokni, and Amin Karbasi. SubGen: Token generation in sublinear time and memory.arXiv preprint arXiv:2402.06082, 2024

arXiv 2024

[43] [43]

Turboquant: Online vec- tor quantization with near-optimal distortion rate

Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. Turboquant: Online vec- tor quantization with near-optimal distortion rate. InInternational Conference on Learning Representations (ICLR), 2026

2026

[44] [44]

MixKVQ: Query- aware mixed-precision kv cache quantization for long-context reasoning.arXiv preprint arXiv:2512.19206, 2025

Tao Zhang, Ziqian Zeng, Hao Peng, Huiping Zhuang, and Cen Chen. MixKVQ: Query- aware mixed-precision kv cache quantization for long-context reasoning.arXiv preprint arXiv:2512.19206, 2025

arXiv 2025

[45] [45]

Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization

Tianyi Zhang, Jonah Yi, Zhaozhuo Xu, and Anshumali Shrivastava. Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[46] [46]

H2O: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[47] [47]

Geometry

Yuhao Zhou, Sirui Song, Boyang Liu, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Zhihao Zhang, Wei Li, and Xuanjing Huang. EliteKV: Scalable KV cache compression via RoPE frequency selection and joint low-rank projection.arXiv preprint arXiv:2503.01586, 2025. Appendix Roadmap Appendix A collects proofs for Block-GTQ (error bound, block weight, greedy optimality)....

arXiv 2025