Recognition: no theorem link
When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression
Pith reviewed 2026-05-12 01:42 UTC · model grok-4.3
The pith
Value-aware KV eviction improves cache compression only when it recovers decode-side evidence first, then ranks output value, and preserves coupled evidence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A selector can fail by missing needed evidence, scoring tokens that do not change the output, or breaking related evidence when compressing the cache. The fixed-contract probe, which combines attention mass with the estimated output change from block removal, is positive on 72.6 percent of positive-margin cells and 32.4 percent of nonpositive-margin cells on LongBench. NeedleBench M-RT at 32k and RULER 8k confirm the probe works under branched retrieval. A 264-cell sign evaluation separates support recovery and output-value ranking from boundary leverage effects. The resulting order is to recover decode-side evidence, rank its output value, and preserve coupled evidence during projection.
What carries the argument
The fixed-contract diagnostic, which holds the selector setup fixed and changes one decision slot at a time, together with a value probe that merges block attention mass and estimated output change from removal.
If this is right
- The probe aligns with positive margins on 72.6 percent of helpful cells and 32.4 percent of non-helpful cells across three models and two budgets.
- The probe maintains support under branched retrieval on NeedleBench M-RT at 32k and RULER 8k.
- A 264-cell sign evaluation isolates support recovery, output-value ranking, and leverage effects near cache boundaries.
- Selectors must follow the sequence of evidence recovery, then value ranking, then coupled-evidence preservation.
Where Pith is reading between the lines
- The same isolation approach could be used to diagnose failures in other KV compression methods such as quantization or merging.
- If the output-change estimate remains reliable at scale, it could support dynamic cache policies that adapt eviction mid-generation.
- Hybrid selectors might be built by composing separate modules for each step in the identified order rather than learning a single scoring function.
Load-bearing premise
The estimated output change from removing a block accurately captures its true value to future decoding without confounding the fixed-contract isolation.
What would settle it
A case on LongBench where the probe's estimated output change from block removal does not match the actual output difference observed when that block is evicted during real decoding.
Figures
read the original abstract
Long-context LLM inference is bottlenecked by the memory and bandwidth cost of reading large KV caches during decoding. KV compression reduces this cost by keeping only part of the cache, but task accuracy alone does not identify why a selector succeeds or fails. A selector can fail at three steps: it may miss the evidence future decoding needs, give high scores to tokens that do not affect the output, or break related evidence when fitting scores into a small cache. We introduce a fixed-contract diagnostic that holds the selector's setup fixed and changes one decision slot at a time. For value ranking, the probe combines a block's attention mass with the estimated output change from removing it. On LongBench across three models and two budgets, the probe is positive on 72.6% of positive-margin cells and 32.4% of nonpositive-margin cells. NeedleBench M-RT at 32k and a RULER 8k check probe support closure under branched retrieval, and a 264-cell sign evaluation separates support recovery and output-value ranking from leverage effects near the boundary. The resulting order is to recover decode-side evidence, rank its output value, and preserve coupled evidence during projection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a fixed-contract diagnostic for evaluating KV cache compression selectors in long-context LLMs. It identifies three potential failure points in selectors: missing necessary evidence, assigning high scores to low-impact tokens, and disrupting coupled evidence when compressing. The diagnostic maintains the selector's contract fixed while varying one slot. For value ranking, the probe integrates attention mass with the estimated change in output from removing a block. Empirical evaluation on LongBench across three models and two budgets shows the probe positive on 72.6% of positive-margin cells and 32.4% of nonpositive-margin cells. Additional checks on NeedleBench and RULER support the findings, leading to the ordering: recover decode-side evidence, rank its output value, and preserve coupled evidence during projection.
Significance. If the diagnostic's isolation procedure holds, this work offers a valuable tool for dissecting why certain KV eviction methods succeed or fail, moving beyond aggregate accuracy metrics. It could inform the design of more robust cache compression strategies for efficient LLM inference, particularly in identifying when value-aware approaches provide benefits. The fixed-contract approach and multi-benchmark validation are strengths.
major comments (2)
- [Value-ranking probe and LongBench evaluation] The value-ranking probe (abstract and experimental results) combines attention mass with estimated output change from block removal under fixed-contract isolation. The removal simulation may introduce confounding interactions such as altered attention patterns or new token dependencies not present in the original cache, potentially misclassifying the true marginal value. This directly affects the reliability of the reported 72.6% vs 32.4% separation on LongBench and the derived ordering.
- [LongBench results] LongBench results (across three models and two budgets): the percentages 72.6% and 32.4% are presented without error bars, confidence intervals, details on data exclusion rules, or full experimental protocol. This makes it difficult to assess statistical significance and robustness of the central empirical claim.
minor comments (2)
- [Abstract] The abstract references 'a 264-cell sign evaluation' without elaboration; adding a brief description or pointer to the relevant section would improve clarity.
- [Experimental sections] Consider including exact model names, context lengths, and budget sizes (e.g., in a table) when summarizing the LongBench, NeedleBench, and RULER results for improved reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the fixed-contract diagnostic and its empirical validation. We address each major comment below, clarifying the design choices in the value-ranking probe and committing to improved statistical reporting for the LongBench results.
read point-by-point responses
-
Referee: [Value-ranking probe and LongBench evaluation] The value-ranking probe (abstract and experimental results) combines attention mass with estimated output change from block removal under fixed-contract isolation. The removal simulation may introduce confounding interactions such as altered attention patterns or new token dependencies not present in the original cache, potentially misclassifying the true marginal value. This directly affects the reliability of the reported 72.6% vs 32.4% separation on LongBench and the derived ordering.
Authors: The fixed-contract diagnostic deliberately isolates one decision slot while holding the selector's overall contract (cache size, eviction policy for remaining tokens) fixed, which is intended to reduce the scope of secondary interactions compared to full re-encoding. The probe further combines attention mass with a direct estimate of output logit change under this isolation to approximate marginal value. We agree that residual confounds from attention redistribution cannot be entirely eliminated in simulation. In revision we will add an explicit limitations paragraph discussing this point, together with the supporting evidence from the NeedleBench M-RT and RULER checks that the ordering remains consistent under branched retrieval. We do not claim the probe is an oracle, only that it yields a useful diagnostic separation (72.6 % positive-margin vs 32.4 % non-positive-margin cells) that is corroborated across benchmarks. revision: partial
-
Referee: [LongBench results] LongBench results (across three models and two budgets): the percentages 72.6% and 32.4% are presented without error bars, confidence intervals, details on data exclusion rules, or full experimental protocol. This makes it difficult to assess statistical significance and robustness of the central empirical claim.
Authors: We accept this criticism. The revised manuscript will report bootstrap confidence intervals for both percentages, state the exact cell-inclusion criteria (minimum 50 cells per model-budget pair), and move the complete experimental protocol—including model versions, random seeds, and token-removal simulation details—into a new appendix section. These additions will allow direct assessment of the robustness of the reported separation. revision: yes
Circularity Check
No significant circularity; claims are direct empirical measurements
full rationale
The paper presents a fixed-contract diagnostic consisting of controlled experiments that hold the selector setup fixed while altering one decision slot at a time. The reported probe results (positive on 72.6% of positive-margin cells and 32.4% of nonpositive-margin cells on LongBench, plus NeedleBench and RULER checks) are direct observations from these held-fixed runs across models and budgets. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or description. The central ordering (recover evidence, rank value, preserve coupled evidence) follows from the sign evaluations rather than reducing to inputs by construction. This is self-contained empirical work with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
free parameters (2)
- cache budget sizes
- model selection
axioms (1)
- domain assumption Holding the selector setup fixed while varying one decision slot isolates the contribution of that slot.
Reference graph
Works this paper leans on
-
[1]
LongBench: A bilingual, multitask benchmark for long context understanding
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. LongBench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3119–3137, 2024
work page 2024
-
[2]
Z. Cai, Y. Zhang, B. Gao, Y. Liu, T. Liu, K. Lu, W. Xiong, Y. Dong, B. Chang, J. Hu, and W. Xiao. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling. InProceedings of the Second Conference on Language Modeling (COLM), 2025
work page 2025
-
[3]
Z. Cai, W. Xiao, H. Sun, C. Luo, Y. Zhang, K. Wan, Y. Li, Y. Zhou, L.-W. Chang, J. Gu, Z. Dong, A. Anandkumar, A. Asi, and J. Hu. R-KV: Redundancy-aware KV cache compression for reasoning models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=2jwAjomEDB
work page 2025
- [4]
- [5]
- [6]
- [7]
-
[8]
Y. An, C. Lu, K. Zhu, T. Yu, C. Zhao, H. Wu, M. Tang, and J. Wang. ReST-KV: Robust KV cache eviction with layer-wise output reconstruction and spatial-temporal smoothing. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
- [9]
-
[10]
KV Cache Offloading for Context-Intensive Tasks
A. Bocharnikov, I. Ermakov, D. Kuznedelev, V. Zhdanovskiy, and Y. Yershov. KV Cache Offloading for Context-Intensive Tasks.arXiv preprint arXiv:2604.08426, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [11]
-
[12]
J. Yuan, H. Gao, D. Dai, J. Luo, L. Zhao, Z. Zhang, Z. Xie, Y. Wei, L. Wang, Z. Xiao, Y. Wang, C. Ruan, M. Zhang, W. Liang, and W. Zeng. Native sparse attention: Hardware-aligned and natively trainable sparse attention. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23078–23097, 2025
work page 2025
-
[13]
DeepSeek-V4: Towards highly efficient million-token context intelligence
DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence. Techni- cal report, 2026. URLhttps://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/resolve/ main/DeepSeek_V4.pdf
work page 2026
-
[14]
A. Devoto, M. Jeblick, and S. Jégou. Expected attention: Kv cache compression by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025
- [15]
- [16]
-
[17]
Y. Feng, J. Lv, Y. Cao, X. Xie, and S. K. Zhou. Ada-KV: Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=tcisuhGsQZ
work page 2025
- [18]
-
[19]
S. Ge, Y. Zhang, L. Liu, M. Zhang, J. Han, and J. Gao. Model tells you what to discard: Adaptive KV cache compression for LLMs. InInternational Conference on Learning Representations, 2024
work page 2024
- [20]
-
[21]
X. Li, X. Jin, and L. Zhang. GraphKV: Breaking the static selection paradigm with graph-based KV cache eviction. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 21899–21909, Suzhou, China, 2025. Association for Computational Linguistics
work page 2025
-
[22]
Coleman Richard Charles Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, S. Shao, K. Keutzer, and A. Gholami. KVQuant: Towards 10 million context length LLM inference with KV cache quantization. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.https: //openreview.net/forum?id=0LXotew9Du
work page 2024
-
[23]
K. Huang, H. Meng, J. Wu, J. Lu, C. Ma, Z. Chen, X. Wang, B. Ding, J. Wu, X. Wang, X. He, G. Wang, J. Zhou. Beyond Magnitude: Leveraging Direction of RLVR Updates for LLM Reasoning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview. net/forum?id=r6Pw3RiMYL
work page 2026
-
[24]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024. COLM 2024. URLhttps://arxiv.org/abs/2404.06654
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Mo Li, Songyang Zhang, Taolin Zhang, Haodong Duan, Yunxin Liu, and Kai Chen. NeedleBench: Evaluat- ing LLM retrieval and reasoning across varying information densities.Transactions on Machine Learning Research, 2025. URLhttps://mlanthology.org/tmlr/2025/li2025tmlr-needlebench/
work page 2025
-
[26]
Y. Li, Y. Huang, B. Yang, B. Venkitesh, A. Locatelli, H. Ye, T. Cai, P. Lewis, and D. Chen. SnapKV: LLM knows what you are looking for before generation. InAdvances in Neural Information Processing Systems, volume 37, pages 22947–22970, 2024. Curran Associates, Inc
work page 2024
-
[27]
Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V. Braverman, B. Chen, and X. Hu. KIVI: A tuning-free asymmetric 2bit quantization for KV cache. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 32332–32344. PMLR, 2024
work page 2024
- [28]
- [29]
- [30]
-
[31]
URLhttps://openreview.net/forum?id=tDRYrAkOB7. 11
-
[32]
M. Oren, M. Hassid, N. Yarden, Y. Adi, and R. Schwartz. Transformers are multi-state RNNs. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18724–18741. Association for Computational Linguistics, 2024
work page 2024
-
[33]
H. Tang, Y. Lin, J. Lin, Q. Han, D. Ke, S. Hong, Y. Yao, and G. Wang. RazorAttention: Efficient KV cache compression through retrieval heads. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[34]
Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference
R. Taniguchi, Y. Dong, M. Onizuka, and C. Xiao. Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference.arXiv preprint arXiv:2601.07667, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[35]
C.-C. Chang, W.-C. Lin, C.-Y. Lin, C.-Y. Chen, Y.-F. Hu, P.-S. Wang, N.-C. Huang, L. Ceze, M. S. Abdelfattah, and K.-C. Wu. Palu: KV-cache compression with low-rank projection. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
- [36]
-
[37]
J.-H. Kim, J. Kim, S. Kwon, J. W. Lee, S. Yun, and H. O. Song. KVzip: Query-agnostic KV cache compression with context reconstruction. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[38]
Z. Qin, Y. Cao, M. Lin, W. Hu, S. Fan, K. Cheng, W. Lin, and J. Li. CAKE: Cascading and adaptive KV cache eviction with layer preferences. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[39]
J. Lei and S. Ilager. ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs.arXiv preprint arXiv:2603.08727, 2026
-
[40]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [41]
-
[42]
Z. Su, H. Zhang, W. Wu, Y. Zhang, Y. Liu, H. Xiao, Q. Yang, Y. Sun, R. Yang, C. Zhang, K. Fan, W. Ye, J. Xiong, H. Shen, C. Tao, T. Wu, Z. Wan, Y. Qian, Y. Xie, and N. Wong. Attention sink in transformers: A survey on utilization, interpretation, and mitigation.arXiv preprint arXiv:2604.10098, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[43]
Do LLMs Encode Functional Importance of Reasoning Tokens?
J. Singh and D. Hakkani-Tür. Do LLMs Encode Functional Importance of Reasoning Tokens?arXiv preprint arXiv:2601.03066, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[44]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
A. Q. Jiang et al. Mistral 7B.arXiv preprint arXiv:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [46]
-
[47]
G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis. Efficient Streaming Language Models with Attention Sinks. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=NG7sS51zVF
work page 2024
-
[48]
Z. Guo, H. Kamigaito, and T. Watanabe. Attention score is not all you need for token importance indicator in KV cache reduction: Value also matters. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21158–21166, 2024. 12
work page 2024
- [49]
-
[50]
Y. Feng, H. Guo, J. Lv, S Kevin Zhou, and X. Xie. DefensiveKV: Taming the Fragility of KV Cache Eviction in LLM Inference. InThe Fourteenth International Conference on Learning Representations,
-
[51]
URLhttps://openreview.net/forum?id=nJgS06sX3O
-
[52]
J. Ahn, I. Seong, A. Kedia, J. Kim, H. Jang, K. Lee, and Y. Jeon. LookaheadKV: Fast and accurate KV cacheevictionbyglimpsingintothefuturewithoutgeneration. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=RVLMGPXt2i
work page 2026
-
[53]
Y. Wang, S. Ji, Y. Liu, Y. Xu, Y. Xu, Q. Zhu, and W. Che. Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 34146–34162, 2025
work page 2025
-
[54]
Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y. Tian, C. Re, C. Barrett, Z. Wang, and B. Chen. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps: //openreview.net/forum?id=RkRrPp7GKO. 13 Appendix •Appendix A: notatio...
work page 2023
-
[55]
Budget0.10is the strongest fixed default, with+1.066pp, 22 wins, 10 losses, and 4 ties, but it captures only55 .5%of the per-trajectory oracle-best mean and is exact-best or tied in only 16 of 36 trajectories. The largest adjacent transitions occur in Qwen3 TREC, Qwen3 passage retrieval, and Mistral TREC, with ranges from6.250to13.281pp. 34 0.02 0.05 0.10...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.