Recognition: no theorem link
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
Pith reviewed 2026-05-12 04:56 UTC · model grok-4.3
The pith
A learnable global KV eviction policy can match or exceed full-cache performance on long-context tasks while using far less memory.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that a unified retention-based eviction method, using lightweight gates per layer and a shared scoring projection, learns to retain tokens that will be useful later. This not only compresses the cache but improves generation by reducing attention dilution from irrelevant evidence. The approach is justified theoretically as a query-agnostic proxy for future utility and demonstrated empirically to match or surpass full-cache inference on diverse benchmarks.
What carries the argument
Lightweight retention gates that assign utility scores to cached KV entries, combined with a shared final scoring projection for global calibration across layers and heads.
Load-bearing premise
That the learned retention scores will correctly identify which tokens will be useful in the future across a wide range of new tasks and data distributions.
What would settle it
Observing whether the method falls below full-cache performance on a new long-context benchmark that was not used in training or tuning.
read the original abstract
The key-value (KV) cache is a major bottleneck in long-context inference, where memory and computation grow with sequence length. Existing KV eviction methods reduce this cost but typically degrade performance relative to full-cache inference. Our key insight is that full-cache attention is not always optimal: in long contexts, irrelevant tokens can dilute attention away from useful evidence, so selective, learnable eviction can improve generation rather than merely approximate the full cache. We introduce a global retention-based KV eviction method that learns each token's future utility under a unified memory budget. Lightweight retention gates assign utility scores to cached KV entries, and a shared final scoring projection calibrates these scores across all layers and heads. This enables a single global eviction policy in which tokens from different layers, heads, and modalities compete directly for cache capacity. We further provide theoretical analysis showing that preferentially retaining useful tokens reduces attention dilution, and we justify geometric retention as a query-agnostic proxy for future utility. Across diverse long-context language and vision-language reasoning, and multi-turn dialogue benchmarks, our method substantially reduces KV memory while matching or surpassing full-cache inference. These results suggest that learned, globally calibrated KV eviction is not only a compression technique, but also a mechanism for improving long-context reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a global retention-based KV eviction method for long-context inference in language and vision-language models. Lightweight retention gates assign utility scores to cached KV entries, calibrated by a shared final scoring projection across layers, heads, and modalities. This enables tokens to compete directly under a unified memory budget. The central claims are that selective, learnable eviction can reduce attention dilution from irrelevant tokens (unlike full-cache attention) and that geometric retention serves as a query-agnostic proxy for future utility, supported by theoretical analysis. Experiments across long-context reasoning, vision-language, and multi-turn dialogue benchmarks show substantial KV memory reduction while matching or surpassing full-cache performance.
Significance. If the empirical results and generalization hold, the work is significant because it reframes KV eviction from a lossy approximation of full-cache inference to an active mechanism for improving long-context reasoning via reduced dilution. The global cross-layer/head/modal competition and the theoretical justification for geometric retention are distinctive contributions. The approach could influence efficient inference designs if the learned components prove robust without task-specific tuning.
major comments (2)
- [Method and Experiments sections] The central empirical claim (matching or surpassing full cache) depends on the learned retention gates and shared scoring projection generalizing future token utility across task distributions. The training procedure for these components (detailed in the method section) must be shown to avoid overfitting to narrow sequences or modalities; without explicit zero-shot transfer ablations or OOD benchmarks, the global eviction policy risks mis-ranking tokens on unseen long contexts, either evicting useful evidence or retaining noise.
- [Theoretical analysis section] § on theoretical analysis: The justification that geometric retention is a query-agnostic proxy for utility and that preferential retention reduces dilution is load-bearing for interpreting the gains as improvement rather than approximation. This needs to be connected explicitly to the learned component; if the theory assumes fixed or oracle scores, it does not fully underwrite the learned policy's behavior under distribution shift.
minor comments (3)
- [Method section] Clarify the exact form of the retention gate (e.g., its input features and activation) and the shared projection matrix dimensions to allow reproduction.
- [Experiments section] Add statistical significance tests or variance across runs for the benchmark comparisons to strengthen the claim of matching or surpassing full cache.
- [Experiments section] Ensure all baselines (including recent eviction methods) are described with identical hyper-parameters and cache budgets for fair comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments. We address each major comment below, clarifying the generalization properties of our method and strengthening the connection between theory and the learned policy. We outline specific revisions to the manuscript.
read point-by-point responses
-
Referee: [Method and Experiments sections] The central empirical claim (matching or surpassing full cache) depends on the learned retention gates and shared scoring projection generalizing future token utility across task distributions. The training procedure for these components (detailed in the method section) must be shown to avoid overfitting to narrow sequences or modalities; without explicit zero-shot transfer ablations or OOD benchmarks, the global eviction policy risks mis-ranking tokens on unseen long contexts, either evicting useful evidence or retaining noise.
Authors: We appreciate this concern regarding generalization of the learned retention gates and shared scoring projection. Our training uses diverse sequences spanning language and vision-language modalities without task-specific fine-tuning, and the global cross-layer/head calibration is explicitly designed to enable tokens to compete under a unified budget, promoting robustness. Experiments across long-context reasoning, vision-language, and multi-turn dialogue benchmarks show consistent matching or surpassing of full-cache performance. To directly address potential overfitting and distribution shift, we will add explicit zero-shot transfer ablations and OOD benchmarks (e.g., evaluating the eviction policy on held-out task distributions) in the revised Experiments section. revision: yes
-
Referee: [Theoretical analysis section] § on theoretical analysis: The justification that geometric retention is a query-agnostic proxy for utility and that preferential retention reduces dilution is load-bearing for interpreting the gains as improvement rather than approximation. This needs to be connected explicitly to the learned component; if the theory assumes fixed or oracle scores, it does not fully underwrite the learned policy's behavior under distribution shift.
Authors: We agree that an explicit link between the theoretical analysis and the learned retention policy is necessary. The theory establishes that geometric retention acts as a query-agnostic proxy for future utility and that retaining high-utility tokens reduces attention dilution, without assuming oracle scores; it holds for any scoring mechanism that ranks tokens by utility. Our learned gates are trained to approximate this utility via the retention objective, with the shared projection providing cross-layer calibration. We will revise the Theoretical analysis section to add a new subsection explicitly connecting the learned components to the theory, including how the training aligns with the proxy assumption and citing ablation results showing the policy's behavior under the distribution shifts present in our benchmarks. revision: yes
Circularity Check
No significant circularity; results are empirical benchmarks from trained retention model
full rationale
The paper trains lightweight retention gates and a shared scoring projection on data to assign utility scores, then evaluates the resulting eviction policy on diverse long-context benchmarks. Performance gains are measured externally rather than defined to equal the training objective by construction. The theoretical analysis of attention dilution and geometric retention as a query-agnostic proxy is presented as supporting justification but does not reduce the reported benchmark numbers to a tautology. Any self-citations are not load-bearing for the central empirical claims, which remain falsifiable against full-cache baselines.
Axiom & Free-Parameter Ledger
free parameters (2)
- retention gate weights
- shared scoring projection weights
axioms (1)
- domain assumption Geometric retention serves as a query-agnostic proxy for future token utility
invented entities (2)
-
lightweight retention gates
no independent evidence
-
shared final scoring projection
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025
URLhttps://artofproblemsolving. com/wiki/index.php/AIME_Problems_and_Solutions. 14 Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench v2: Towards deeper understanding...
-
[2]
Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, and Rex Ying. Cache what lasts: Token retention for memory-bounded kv cache in llms.arXiv preprint arXiv:2512.03324,
-
[3]
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069,
work page internal anchor Pith review arXiv
-
[4]
Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, et al. R-kv: Redundancy-aware kv cache compression for training-free reasoning models acceleration.arXiv preprint arXiv:2505.24133,
-
[5]
Training Verifiers to Solve Math Word Problems
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An imageisworth1/2tokensafterlayer2: Plug-and-playinferenceaccelerationforlargevision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024a. Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Yichuan Deng, Zhao Song, Jing Xiong, and Chiwun Yang. How sparse attention approximates exact attention? your attention is naturallynC-sparse.arXiv preprint arXiv:2404.02690,
-
[9]
Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550,
-
[10]
Seerattention-r: Sparse attention adaptation for long reasoning.arXiv preprint arXiv:2506.08889,
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025a. Chaoyou Fu, Yuhan Dai, Yongdong Luo, ...
-
[11]
Ravi Ghadia, Avinash Kumar, Gaurav Jain, Prashant Nair, and Poulami Das. Dialogue without limits: Constant-sized kv caches for extended responses in llms.arXiv preprint arXiv:2503.00979,
-
[12]
Lm-infinite: Simple on-the-fly length generalization for large language models
Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm-infinite: Zero-shot extreme length generalization for large language models.arXiv preprint arXiv:2308.16137,
-
[13]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826,
work page internal anchor Pith review arXiv
-
[15]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749, 2025a. Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, and Zhiyuan Liu. Locret: Enhancing eviction in long-context LLM inference wit...
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
16 Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava- next-interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024a. Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang...
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, et al. Retrievalattention: Accelerating long-context llm inference via vector retrieval.arXiv preprint arXiv:2409.10516, 2024a. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im...
-
[18]
Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, and Edoardo M Ponti. Dynamic memory compression: Retrofitting llms for accelerated inference.arXiv preprint arXiv:2403.09636,
-
[19]
Junyoung Park, Dalton Jones, Matthew J Morse, Raghavv Goel, Mingu Lee, and Chris Lott. Keyd- 17 Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction iff: Key similarity-based kv cache eviction for long-context llm inference in resource-constrained environments.arXiv preprint arXiv:2504.15364,
-
[20]
ZiranQin, YuchenCao, MingbaoLin, WenHu, ShixuanFan,KeCheng, WeiyaoLin, andJianguoLi. Cake: Cascading and adaptive kv cache eviction with layer preferences.arXiv preprint arXiv:2503.12491,
-
[21]
Videomathqa: Benchmarking mathematical reasoning via multimodal understanding in videos, 2025
Hanoona Rasheed, Abdelrahman Shaker, Anqi Tang, Muhammad Maaz, Ming-Hsuan Yang, Salman Khan, and Fahad Shahbaz Khan. Videomathqa: Benchmarking mathematical reasoning via multi- modal understanding in videos.arXiv preprint arXiv:2506.05349,
-
[22]
Ranjan Sapkota, Yang Cao, Konstantinos I Roumeliotis, and Manoj Karkee. Vision-language-action models: Concepts, progress, applications and challenges.arXiv preprint arXiv:2505.04769,
-
[23]
Shadowkv: Kv cache in shadows for high-throughput long-context llm inference
Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. Shadowkv: Kv cache in shadows for high-throughput long-context llm inference. arXiv preprint arXiv:2410.21465,
-
[24]
arXiv preprint arXiv:2406.10774 , year=
Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query- aware sparsity for efficient long-context llm inference.arXiv preprint arXiv:2406.10774,
-
[25]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu, and Panpan Xu. Vl-cache: Sparsity and modality-aware kv cachecompressionforvision-languagemodelinferenceacceleration.arXivpreprintarXiv:2410.23317,
-
[27]
Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, and Mi Zhang. Meda: Dynamic kv cache allocation for efficient multimodal long-context inference.arXiv preprint arXiv:2502.17599,
-
[28]
Guangtao Wang, Shubhangi Upasani, Chen Wu, Darshan Gandhi, Jonathan Li, Changran Hu, Bo Li, and Urmish Thakker. Llms know what to drop: Self-attention guided kv cache eviction for efficient long-context inference.arXiv preprint arXiv:2503.08879,
-
[29]
Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, and Linfeng Zhang. Stop looking for important tokens in multimodal language models: Duplication 18 Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction matters more.arXiv preprint arXiv:2502.11494,
-
[30]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Minglai Yang, Ethan Huang, Liang Zhang, Mihai Surdeanu, William Yang Wang, and Liangming Pan. How is llm reasoning distracted by irrelevant context? an analysis using a controlled benchmark. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13340–13358, 2025a. SenqiaoYang, YukangChen, ZhuotaoTian, ChengyaoWang...
-
[32]
MMMU -Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.736. Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, and Liqiang Nie. Wkvquant: Quantizing weight and key/value cache for large language models gains more.arXiv preprint arXiv:2402.12065,
-
[33]
Lmms- eval: Reality check on the evaluation of large multimodal models
Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Reality check on the evaluation of large multimodal models, 2024a. URLhttps://arxiv.org/abs/2407.12772. Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, and Shanghang...
-
[34]
Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024b. Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video in...
-
[35]
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
20 Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction Appendix of “Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction” Table of Contents A Theoretical Results 22 A.1 Proofs for Attention Dilution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 A.2 Geometric Re...
work page 2024
-
[36]
The extended metrics reveal several critical trends regarding model performance under extreme sequence compression. First, DBTrimKV consistently outperforms not only the competing eviction methods but also the Vanilla full-cache baseline across all tested KV budgets (512, 256, and 128). Notably, even under the extreme constraint of a 128-token budget, DBT...
-
[37]
Results are averaged over 5 random seeds. 2048 4096 8192 16384 32768 Generation Length (T okens) 200 300 400 500Throughput (tok/s) Vanilla DBTrimKV TrimKV 2048 4096 8192 16384 32768 Generation Length (T okens) 0 250 500 750 1000 1250 1500Decoding Time (s) Vanilla DBTrimKV TrimKV Figure11: Efficiency scaling with generation length. The figure reports throu...
work page 2048
-
[38]
Results are averaged over 5 random seeds
We do not report vanilla performance for generation length at 32k due to OOM error. Results are averaged over 5 random seeds. 31 Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction 128 256 512 1024 2048 4096 KV Budget 200 300 400 500Throughput (tok/s) Vanilla DBTrimKV TrimKV 128 256 512 1024 2048 4096 KV Budget 300 400...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.