pith. machine review for the scientific record. sign in

arxiv: 2605.09649 · v1 · submitted 2026-05-10 · 💻 cs.LG

Recognition: no theorem link

Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

Arman Cohan, Hieu Trung Nguyen, Ngoc Bui, Rex Ying

Pith reviewed 2026-05-12 04:56 UTC · model grok-4.3

classification 💻 cs.LG
keywords KV cache evictionlong-context inferenceretention scoresattention dilutionmemory compressionlanguage modelsvision-language models
0
0 comments X

The pith

A learnable global KV eviction policy can match or exceed full-cache performance on long-context tasks while using far less memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that in long contexts, keeping every token in the KV cache can actually hurt performance because irrelevant tokens dilute the attention mechanism. Instead of trying to approximate the full cache, the authors propose learning retention scores for each token to decide what to keep under a fixed memory budget. This global scoring allows tokens from different layers and modalities to compete for cache space based on their predicted future utility. If correct, this means long-context models can run with smaller memory footprints without losing, and sometimes gaining, reasoning quality across language, vision-language, and dialogue tasks.

Core claim

The central discovery is that a unified retention-based eviction method, using lightweight gates per layer and a shared scoring projection, learns to retain tokens that will be useful later. This not only compresses the cache but improves generation by reducing attention dilution from irrelevant evidence. The approach is justified theoretically as a query-agnostic proxy for future utility and demonstrated empirically to match or surpass full-cache inference on diverse benchmarks.

What carries the argument

Lightweight retention gates that assign utility scores to cached KV entries, combined with a shared final scoring projection for global calibration across layers and heads.

Load-bearing premise

That the learned retention scores will correctly identify which tokens will be useful in the future across a wide range of new tasks and data distributions.

What would settle it

Observing whether the method falls below full-cache performance on a new long-context benchmark that was not used in training or tuning.

read the original abstract

The key-value (KV) cache is a major bottleneck in long-context inference, where memory and computation grow with sequence length. Existing KV eviction methods reduce this cost but typically degrade performance relative to full-cache inference. Our key insight is that full-cache attention is not always optimal: in long contexts, irrelevant tokens can dilute attention away from useful evidence, so selective, learnable eviction can improve generation rather than merely approximate the full cache. We introduce a global retention-based KV eviction method that learns each token's future utility under a unified memory budget. Lightweight retention gates assign utility scores to cached KV entries, and a shared final scoring projection calibrates these scores across all layers and heads. This enables a single global eviction policy in which tokens from different layers, heads, and modalities compete directly for cache capacity. We further provide theoretical analysis showing that preferentially retaining useful tokens reduces attention dilution, and we justify geometric retention as a query-agnostic proxy for future utility. Across diverse long-context language and vision-language reasoning, and multi-turn dialogue benchmarks, our method substantially reduces KV memory while matching or surpassing full-cache inference. These results suggest that learned, globally calibrated KV eviction is not only a compression technique, but also a mechanism for improving long-context reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces a global retention-based KV eviction method for long-context inference in language and vision-language models. Lightweight retention gates assign utility scores to cached KV entries, calibrated by a shared final scoring projection across layers, heads, and modalities. This enables tokens to compete directly under a unified memory budget. The central claims are that selective, learnable eviction can reduce attention dilution from irrelevant tokens (unlike full-cache attention) and that geometric retention serves as a query-agnostic proxy for future utility, supported by theoretical analysis. Experiments across long-context reasoning, vision-language, and multi-turn dialogue benchmarks show substantial KV memory reduction while matching or surpassing full-cache performance.

Significance. If the empirical results and generalization hold, the work is significant because it reframes KV eviction from a lossy approximation of full-cache inference to an active mechanism for improving long-context reasoning via reduced dilution. The global cross-layer/head/modal competition and the theoretical justification for geometric retention are distinctive contributions. The approach could influence efficient inference designs if the learned components prove robust without task-specific tuning.

major comments (2)
  1. [Method and Experiments sections] The central empirical claim (matching or surpassing full cache) depends on the learned retention gates and shared scoring projection generalizing future token utility across task distributions. The training procedure for these components (detailed in the method section) must be shown to avoid overfitting to narrow sequences or modalities; without explicit zero-shot transfer ablations or OOD benchmarks, the global eviction policy risks mis-ranking tokens on unseen long contexts, either evicting useful evidence or retaining noise.
  2. [Theoretical analysis section] § on theoretical analysis: The justification that geometric retention is a query-agnostic proxy for utility and that preferential retention reduces dilution is load-bearing for interpreting the gains as improvement rather than approximation. This needs to be connected explicitly to the learned component; if the theory assumes fixed or oracle scores, it does not fully underwrite the learned policy's behavior under distribution shift.
minor comments (3)
  1. [Method section] Clarify the exact form of the retention gate (e.g., its input features and activation) and the shared projection matrix dimensions to allow reproduction.
  2. [Experiments section] Add statistical significance tests or variance across runs for the benchmark comparisons to strengthen the claim of matching or surpassing full cache.
  3. [Experiments section] Ensure all baselines (including recent eviction methods) are described with identical hyper-parameters and cache budgets for fair comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major comment below, clarifying the generalization properties of our method and strengthening the connection between theory and the learned policy. We outline specific revisions to the manuscript.

read point-by-point responses
  1. Referee: [Method and Experiments sections] The central empirical claim (matching or surpassing full cache) depends on the learned retention gates and shared scoring projection generalizing future token utility across task distributions. The training procedure for these components (detailed in the method section) must be shown to avoid overfitting to narrow sequences or modalities; without explicit zero-shot transfer ablations or OOD benchmarks, the global eviction policy risks mis-ranking tokens on unseen long contexts, either evicting useful evidence or retaining noise.

    Authors: We appreciate this concern regarding generalization of the learned retention gates and shared scoring projection. Our training uses diverse sequences spanning language and vision-language modalities without task-specific fine-tuning, and the global cross-layer/head calibration is explicitly designed to enable tokens to compete under a unified budget, promoting robustness. Experiments across long-context reasoning, vision-language, and multi-turn dialogue benchmarks show consistent matching or surpassing of full-cache performance. To directly address potential overfitting and distribution shift, we will add explicit zero-shot transfer ablations and OOD benchmarks (e.g., evaluating the eviction policy on held-out task distributions) in the revised Experiments section. revision: yes

  2. Referee: [Theoretical analysis section] § on theoretical analysis: The justification that geometric retention is a query-agnostic proxy for utility and that preferential retention reduces dilution is load-bearing for interpreting the gains as improvement rather than approximation. This needs to be connected explicitly to the learned component; if the theory assumes fixed or oracle scores, it does not fully underwrite the learned policy's behavior under distribution shift.

    Authors: We agree that an explicit link between the theoretical analysis and the learned retention policy is necessary. The theory establishes that geometric retention acts as a query-agnostic proxy for future utility and that retaining high-utility tokens reduces attention dilution, without assuming oracle scores; it holds for any scoring mechanism that ranks tokens by utility. Our learned gates are trained to approximate this utility via the retention objective, with the shared projection providing cross-layer calibration. We will revise the Theoretical analysis section to add a new subsection explicitly connecting the learned components to the theory, including how the training aligns with the proxy assumption and citing ablation results showing the policy's behavior under the distribution shifts present in our benchmarks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are empirical benchmarks from trained retention model

full rationale

The paper trains lightweight retention gates and a shared scoring projection on data to assign utility scores, then evaluates the resulting eviction policy on diverse long-context benchmarks. Performance gains are measured externally rather than defined to equal the training objective by construction. The theoretical analysis of attention dilution and geometric retention as a query-agnostic proxy is presented as supporting justification but does not reduce the reported benchmark numbers to a tautology. Any self-citations are not load-bearing for the central empirical claims, which remain falsifiable against full-cache baselines.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The method rests on learned parameters for retention gates and a domain assumption that geometric retention proxies future utility; no new physical entities are postulated.

free parameters (2)
  • retention gate weights
    Learned parameters that produce per-token utility scores during training.
  • shared scoring projection weights
    Learned parameters that calibrate scores across layers and heads.
axioms (1)
  • domain assumption Geometric retention serves as a query-agnostic proxy for future token utility
    Invoked to justify the eviction policy without per-query dependence.
invented entities (2)
  • lightweight retention gates no independent evidence
    purpose: Assign utility scores to KV cache entries
    New learned component introduced to enable selective eviction.
  • shared final scoring projection no independent evidence
    purpose: Calibrate utility scores globally across layers and heads
    New component to enable direct competition for cache slots.

pith-pipeline@v0.9.0 · 5529 in / 1295 out tokens · 61806 ms · 2026-05-12T04:56:30.866814+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 10 internal anchors

  1. [1]

    Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025

    URLhttps://artofproblemsolving. com/wiki/index.php/AIME_Problems_and_Solutions. 14 Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench v2: Towards deeper understanding...

  2. [2]

    Cache what lasts: Token retention for memory-bounded kv cache in llms.arXiv preprint arXiv:2512.03324,

    Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, and Rex Ying. Cache what lasts: Token retention for memory-bounded kv cache in llms.arXiv preprint arXiv:2512.03324,

  3. [3]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069,

  4. [4]

    R-kv: Redundancy-aware kv cache compression for training-free reasoning models acceleration.arXiv preprint arXiv:2505.24133,

    Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, et al. R-kv: Redundancy-aware kv cache compression for training-free reasoning models acceleration.arXiv preprint arXiv:2505.24133,

  5. [5]

    Training Verifiers to Solve Math Word Problems

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An imageisworth1/2tokensafterlayer2: Plug-and-playinferenceaccelerationforlargevision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024a. Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi ...

  6. [6]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

  7. [7]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691,

  8. [8]

    How sparse attention approximates exact attention? your attention is naturallynC-sparse.arXiv preprint arXiv:2404.02690,

    Yichuan Deng, Zhao Song, Jing Xiong, and Chiwun Yang. How sparse attention approximates exact attention? your attention is naturallynC-sparse.arXiv preprint arXiv:2404.02690,

  9. [9]

    Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550, 2024

    Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550,

  10. [10]

    Seerattention-r: Sparse attention adaptation for long reasoning.arXiv preprint arXiv:2506.08889,

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025a. Chaoyou Fu, Yuhan Dai, Yongdong Luo, ...

  11. [11]

    Dialogue without limits: Constant-sized kv caches for extended responses in llms.arXiv preprint arXiv:2503.00979,

    Ravi Ghadia, Avinash Kumar, Gaurav Jain, Prashant Nair, and Poulami Das. Dialogue without limits: Constant-sized kv caches for extended responses in llms.arXiv preprint arXiv:2503.00979,

  12. [12]

    Lm-infinite: Simple on-the-fly length generalization for large language models

    Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm-infinite: Zero-shot extreme length generalization for large language models.arXiv preprint arXiv:2308.16137,

  13. [13]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

  14. [14]

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826,

  15. [15]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749, 2025a. Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, and Zhiyuan Liu. Locret: Enhancing eviction in long-context LLM inference wit...

  16. [16]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    16 Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava- next-interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024a. Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang...

  17. [17]

    Retrievalattention: Accelerating long- context llm inference via vector retrieval.arXiv preprint arXiv:2409.10516, 2024

    Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, et al. Retrievalattention: Accelerating long-context llm inference via vector retrieval.arXiv preprint arXiv:2409.10516, 2024a. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im...

  18. [18]

    Dynamic memory compression: Retrofitting llms for accelerated inference.arXiv preprint arXiv:2403.09636,

    Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, and Edoardo M Ponti. Dynamic memory compression: Retrofitting llms for accelerated inference.arXiv preprint arXiv:2403.09636,

  19. [19]

    Junyoung Park, Dalton Jones, Matthew J Morse, Raghavv Goel, Mingu Lee, and Chris Lott. Keyd- 17 Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction iff: Key similarity-based kv cache eviction for long-context llm inference in resource-constrained environments.arXiv preprint arXiv:2504.15364,

  20. [20]

    Cake: Cascading and adaptive kv cache eviction with layer preferences.arXiv preprint arXiv:2503.12491, 2025

    ZiranQin, YuchenCao, MingbaoLin, WenHu, ShixuanFan,KeCheng, WeiyaoLin, andJianguoLi. Cake: Cascading and adaptive kv cache eviction with layer preferences.arXiv preprint arXiv:2503.12491,

  21. [21]

    Videomathqa: Benchmarking mathematical reasoning via multimodal understanding in videos, 2025

    Hanoona Rasheed, Abdelrahman Shaker, Anqi Tang, Muhammad Maaz, Ming-Hsuan Yang, Salman Khan, and Fahad Shahbaz Khan. Videomathqa: Benchmarking mathematical reasoning via multi- modal understanding in videos.arXiv preprint arXiv:2506.05349,

  22. [22]

    Vision- language-action models: Concepts, progress, applications and challenges.arXiv preprint arXiv:2505.04769, 2025

    Ranjan Sapkota, Yang Cao, Konstantinos I Roumeliotis, and Manoj Karkee. Vision-language-action models: Concepts, progress, applications and challenges.arXiv preprint arXiv:2505.04769,

  23. [23]

    Shadowkv: Kv cache in shadows for high-throughput long-context llm inference

    Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. Shadowkv: Kv cache in shadows for high-throughput long-context llm inference. arXiv preprint arXiv:2410.21465,

  24. [24]

    arXiv preprint arXiv:2406.10774 , year=

    Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query- aware sparsity for efficient long-context llm inference.arXiv preprint arXiv:2406.10774,

  25. [25]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

  26. [26]

    Vl-cache: Sparsity and modality-aware kv cachecompressionforvision-languagemodelinferenceacceleration.arXivpreprintarXiv:2410.23317,

    Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu, and Panpan Xu. Vl-cache: Sparsity and modality-aware kv cachecompressionforvision-languagemodelinferenceacceleration.arXivpreprintarXiv:2410.23317,

  27. [27]

    Meda: Dynamic kv cache allocation for efficient multimodal long-context inference.arXiv preprint arXiv:2502.17599,

    Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, and Mi Zhang. Meda: Dynamic kv cache allocation for efficient multimodal long-context inference.arXiv preprint arXiv:2502.17599,

  28. [28]

    Llms know what to drop: Self-attention guided kv cache eviction for efficient long-context inference.arXiv preprint arXiv:2503.08879,

    Guangtao Wang, Shubhangi Upasani, Chen Wu, Darshan Gandhi, Jonathan Li, Changran Hu, Bo Li, and Urmish Thakker. Llms know what to drop: Self-attention guided kv cache eviction for efficient long-context inference.arXiv preprint arXiv:2503.08879,

  29. [29]

    Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, and Linfeng Zhang. Stop looking for important tokens in multimodal language models: Duplication 18 Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction matters more.arXiv preprint arXiv:2502.11494,

  30. [30]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453,

  31. [31]

    R1-onevision: Advancing gen- eralized multimodal reasoning through cross-modal formal- ization.arXiv preprint arXiv:2503.10615, 2025

    Minglai Yang, Ethan Huang, Liang Zhang, Mihai Surdeanu, William Yang Wang, and Liangming Pan. How is llm reasoning distracted by irrelevant context? an analysis using a controlled benchmark. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13340–13358, 2025a. SenqiaoYang, YukangChen, ZhuotaoTian, ChengyaoWang...

  32. [32]

    MMMU -Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.736. Yuxuan Yue, Zhihang Yuan, Haojie Duanmu, Sifan Zhou, Jianlong Wu, and Liqiang Nie. Wkvquant: Quantizing weight and key/value cache for large language models gains more.arXiv preprint arXiv:2402.12065,

  33. [33]

    Lmms- eval: Reality check on the evaluation of large multimodal models

    Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Reality check on the evaluation of large multimodal models, 2024a. URLhttps://arxiv.org/abs/2407.12772. Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, and Shanghang...

  34. [34]

    Sparsevlm: Visual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024

    Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024b. Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video in...

  35. [35]

    Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

    20 Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction Appendix of “Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction” Table of Contents A Theoretical Results 22 A.1 Proofs for Attention Dilution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 A.2 Geometric Re...

  36. [36]

    Visual Perception

    The extended metrics reveal several critical trends regarding model performance under extreme sequence compression. First, DBTrimKV consistently outperforms not only the competing eviction methods but also the Vanilla full-cache baseline across all tested KV budgets (512, 256, and 128). Notably, even under the extreme constraint of a 128-token budget, DBT...

  37. [37]

    Results are averaged over 5 random seeds. 2048 4096 8192 16384 32768 Generation Length (T okens) 200 300 400 500Throughput (tok/s) Vanilla DBTrimKV TrimKV 2048 4096 8192 16384 32768 Generation Length (T okens) 0 250 500 750 1000 1250 1500Decoding Time (s) Vanilla DBTrimKV TrimKV Figure11: Efficiency scaling with generation length. The figure reports throu...

  38. [38]

    Results are averaged over 5 random seeds

    We do not report vanilla performance for generation length at 32k due to OOM error. Results are averaged over 5 random seeds. 31 Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction 128 256 512 1024 2048 4096 KV Budget 200 300 400 500Throughput (tok/s) Vanilla DBTrimKV TrimKV 128 256 512 1024 2048 4096 KV Budget 300 400...