arxiv: 2605.08234 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: no theorem link

When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression

Da Chang, Fanqi Kong, Haozhe Liang, Huaxiao Yin, Li Hu, Ruijie Zhang, Yu Li

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords KV cache compressionvalue-aware evictionlong-context LLM inferencefixed-contract diagnosticcache eviction probeLongBench evaluationnon-monotone cache compression

0 comments

The pith

Value-aware KV eviction improves cache compression only when it recovers decode-side evidence first, then ranks output value, and preserves coupled evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a fixed-contract diagnostic that keeps a KV selector's setup unchanged while altering one decision slot at a time to isolate why eviction succeeds or fails. The diagnostic uses a probe that adds a block's attention mass to the estimated change in final output when that block is removed. Across LongBench tasks with three models and two cache budgets, this probe aligns with positive performance margins in 72.6 percent of helpful cases and only 32.4 percent of non-helpful cases. The resulting ordering requires selectors to recover evidence needed for future decoding, rank its effect on the output, and avoid breaking related evidence when fitting the cache into a small budget. Task accuracy alone cannot reveal these separate failure modes, so the diagnostic explains when value-aware methods actually reduce memory cost without hurting accuracy.

Core claim

A selector can fail by missing needed evidence, scoring tokens that do not change the output, or breaking related evidence when compressing the cache. The fixed-contract probe, which combines attention mass with the estimated output change from block removal, is positive on 72.6 percent of positive-margin cells and 32.4 percent of nonpositive-margin cells on LongBench. NeedleBench M-RT at 32k and RULER 8k confirm the probe works under branched retrieval. A 264-cell sign evaluation separates support recovery and output-value ranking from boundary leverage effects. The resulting order is to recover decode-side evidence, rank its output value, and preserve coupled evidence during projection.

What carries the argument

The fixed-contract diagnostic, which holds the selector setup fixed and changes one decision slot at a time, together with a value probe that merges block attention mass and estimated output change from removal.

If this is right

The probe aligns with positive margins on 72.6 percent of helpful cells and 32.4 percent of non-helpful cells across three models and two budgets.
The probe maintains support under branched retrieval on NeedleBench M-RT at 32k and RULER 8k.
A 264-cell sign evaluation isolates support recovery, output-value ranking, and leverage effects near cache boundaries.
Selectors must follow the sequence of evidence recovery, then value ranking, then coupled-evidence preservation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same isolation approach could be used to diagnose failures in other KV compression methods such as quantization or merging.
If the output-change estimate remains reliable at scale, it could support dynamic cache policies that adapt eviction mid-generation.
Hybrid selectors might be built by composing separate modules for each step in the identified order rather than learning a single scoring function.

Load-bearing premise

The estimated output change from removing a block accurately captures its true value to future decoding without confounding the fixed-contract isolation.

What would settle it

A case on LongBench where the probe's estimated output change from block removal does not match the actual output difference observed when that block is evicted during real decoding.

Figures

Figures reproduced from arXiv: 2605.08234 by Da Chang, Fanqi Kong, Haozhe Liang, Huaxiao Yin, Li Hu, Ruijie Zhang, Yu Li.

**Figure 2.** Figure 2: Stage II value-channel re-ranking shifts retained mass across prefill blocks (Qwen3-8B, HotpotQA, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Stage II prediction and value-channel control under a fixed observation-window contract. Panels [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Stage III separability under block projection. (a) Strict-block lattice slack across the 96-cell grid, [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: separates the residual mean from the variance after count debiasing. The mean correction bends mass toward the prompt tail, while the variance penalty appears mainly in the same tail region. 0.1 0.2 0.5 0.9 position i=T 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 mean ratio cui=cuhead head ref Llama-3.1-8B Qwen3-8B Mistral-7B-v0.3 (a) Post-debias mean. 0.1 0.2 0.5 0.9 position i=T 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Ni Va… view at source ↗

**Figure 6.** Figure 6: Effective head participation on the 8k NIAH retrieval check. [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: Layer-scope sensitivity of the Module I access-support measurement. NDCG@5% measures rank [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 8.** Figure 8: Stage I access gaps predict count-debias repair. [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗

**Figure 9.** Figure 9: Boundary-heavy budget trajectory heatmap. [PITH_FULL_IMAGE:figures/full_fig_p035_9.png] view at source ↗

read the original abstract

Long-context LLM inference is bottlenecked by the memory and bandwidth cost of reading large KV caches during decoding. KV compression reduces this cost by keeping only part of the cache, but task accuracy alone does not identify why a selector succeeds or fails. A selector can fail at three steps: it may miss the evidence future decoding needs, give high scores to tokens that do not affect the output, or break related evidence when fitting scores into a small cache. We introduce a fixed-contract diagnostic that holds the selector's setup fixed and changes one decision slot at a time. For value ranking, the probe combines a block's attention mass with the estimated output change from removing it. On LongBench across three models and two budgets, the probe is positive on 72.6% of positive-margin cells and 32.4% of nonpositive-margin cells. NeedleBench M-RT at 32k and a RULER 8k check probe support closure under branched retrieval, and a 264-cell sign evaluation separates support recovery and output-value ranking from leverage effects near the boundary. The resulting order is to recover decode-side evidence, rank its output value, and preserve coupled evidence during projection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a new fixed-contract probe for diagnosing KV eviction failures and reports a clear 72/32 split on LongBench, but the output-change estimate may not isolate marginal value without side effects.

read the letter

The main takeaway is that this work supplies a diagnostic that holds the selector contract fixed while swapping one slot, then combines attention mass with an estimated output delta to test value ranking. On LongBench across three models and two budgets the probe lights up on 72.6% of positive-margin cells and 32.4% of the rest, which leads them to the ordering recover evidence, rank its output value, then preserve coupled evidence. NeedleBench and RULER checks are added to test closure under branched retrieval, and a 264-cell sign test tries to separate recovery from ranking from boundary leverage effects. That concrete breakdown and the resulting step order are the parts that feel new relative to earlier KV compression papers. The approach is straightforward to implement and gives practitioners a way to inspect why a given selector is winning or losing on a particular task. The soft spot is the probe itself. Removing a block while keeping the rest of the contract fixed can still alter downstream attention patterns and token dependencies that were not present in the original run, so the sign of the output change may not cleanly reflect the block's true marginal contribution. Without error bars or a full protocol description it is hard to judge how sensitive the 72.6/32.4 split is to those interactions. The paper is aimed at people who already build or tune KV cache compressors for long-context inference. A reader who wants a practical way to order recovery, ranking, and preservation steps will find the diagnostic useful even if they end up adjusting the probe. It is worth sending to peer review because the method is new, the empirical separation is stated directly, and the isolation concern is fixable with tighter controls rather than fatal to the whole idea.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a fixed-contract diagnostic for evaluating KV cache compression selectors in long-context LLMs. It identifies three potential failure points in selectors: missing necessary evidence, assigning high scores to low-impact tokens, and disrupting coupled evidence when compressing. The diagnostic maintains the selector's contract fixed while varying one slot. For value ranking, the probe integrates attention mass with the estimated change in output from removing a block. Empirical evaluation on LongBench across three models and two budgets shows the probe positive on 72.6% of positive-margin cells and 32.4% of nonpositive-margin cells. Additional checks on NeedleBench and RULER support the findings, leading to the ordering: recover decode-side evidence, rank its output value, and preserve coupled evidence during projection.

Significance. If the diagnostic's isolation procedure holds, this work offers a valuable tool for dissecting why certain KV eviction methods succeed or fail, moving beyond aggregate accuracy metrics. It could inform the design of more robust cache compression strategies for efficient LLM inference, particularly in identifying when value-aware approaches provide benefits. The fixed-contract approach and multi-benchmark validation are strengths.

major comments (2)

[Value-ranking probe and LongBench evaluation] The value-ranking probe (abstract and experimental results) combines attention mass with estimated output change from block removal under fixed-contract isolation. The removal simulation may introduce confounding interactions such as altered attention patterns or new token dependencies not present in the original cache, potentially misclassifying the true marginal value. This directly affects the reliability of the reported 72.6% vs 32.4% separation on LongBench and the derived ordering.
[LongBench results] LongBench results (across three models and two budgets): the percentages 72.6% and 32.4% are presented without error bars, confidence intervals, details on data exclusion rules, or full experimental protocol. This makes it difficult to assess statistical significance and robustness of the central empirical claim.

minor comments (2)

[Abstract] The abstract references 'a 264-cell sign evaluation' without elaboration; adding a brief description or pointer to the relevant section would improve clarity.
[Experimental sections] Consider including exact model names, context lengths, and budget sizes (e.g., in a table) when summarizing the LongBench, NeedleBench, and RULER results for improved reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the fixed-contract diagnostic and its empirical validation. We address each major comment below, clarifying the design choices in the value-ranking probe and committing to improved statistical reporting for the LongBench results.

read point-by-point responses

Referee: [Value-ranking probe and LongBench evaluation] The value-ranking probe (abstract and experimental results) combines attention mass with estimated output change from block removal under fixed-contract isolation. The removal simulation may introduce confounding interactions such as altered attention patterns or new token dependencies not present in the original cache, potentially misclassifying the true marginal value. This directly affects the reliability of the reported 72.6% vs 32.4% separation on LongBench and the derived ordering.

Authors: The fixed-contract diagnostic deliberately isolates one decision slot while holding the selector's overall contract (cache size, eviction policy for remaining tokens) fixed, which is intended to reduce the scope of secondary interactions compared to full re-encoding. The probe further combines attention mass with a direct estimate of output logit change under this isolation to approximate marginal value. We agree that residual confounds from attention redistribution cannot be entirely eliminated in simulation. In revision we will add an explicit limitations paragraph discussing this point, together with the supporting evidence from the NeedleBench M-RT and RULER checks that the ordering remains consistent under branched retrieval. We do not claim the probe is an oracle, only that it yields a useful diagnostic separation (72.6 % positive-margin vs 32.4 % non-positive-margin cells) that is corroborated across benchmarks. revision: partial
Referee: [LongBench results] LongBench results (across three models and two budgets): the percentages 72.6% and 32.4% are presented without error bars, confidence intervals, details on data exclusion rules, or full experimental protocol. This makes it difficult to assess statistical significance and robustness of the central empirical claim.

Authors: We accept this criticism. The revised manuscript will report bootstrap confidence intervals for both percentages, state the exact cell-inclusion criteria (minimum 50 cells per model-budget pair), and move the complete experimental protocol—including model versions, random seeds, and token-removal simulation details—into a new appendix section. These additions will allow direct assessment of the robustness of the reported separation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims are direct empirical measurements

full rationale

The paper presents a fixed-contract diagnostic consisting of controlled experiments that hold the selector setup fixed while altering one decision slot at a time. The reported probe results (positive on 72.6% of positive-margin cells and 32.4% of nonpositive-margin cells on LongBench, plus NeedleBench and RULER checks) are direct observations from these held-fixed runs across models and budgets. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or description. The central ordering (recover evidence, rank value, preserve coupled evidence) follows from the sign evaluations rather than reducing to inputs by construction. This is self-contained empirical work with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions of transformer KV caching and attention-based importance; no new physical entities or ad-hoc constants are introduced beyond the diagnostic construction itself.

free parameters (2)

cache budget sizes
Two specific budgets chosen for the LongBench experiments; values not derived from first principles.
model selection
Three unnamed models used; choice affects generalizability.

axioms (1)

domain assumption Holding the selector setup fixed while varying one decision slot isolates the contribution of that slot.
Core premise of the fixed-contract diagnostic stated in the abstract.

pith-pipeline@v0.9.0 · 5532 in / 1368 out tokens · 54738 ms · 2026-05-12T01:42:50.793562+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 8 internal anchors

[1]

LongBench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. LongBench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3119–3137, 2024

work page 2024
[2]

Z. Cai, Y. Zhang, B. Gao, Y. Liu, T. Liu, K. Lu, W. Xiong, Y. Dong, B. Chang, J. Hu, and W. Xiao. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling. InProceedings of the Second Conference on Language Modeling (COLM), 2025

work page 2025
[3]

Z. Cai, W. Xiao, H. Sun, C. Luo, Y. Zhang, K. Wan, Y. Li, Y. Zhou, L.-W. Chang, J. Gu, Z. Dong, A. Anandkumar, A. Asi, and J. Hu. R-KV: Redundancy-aware KV cache compression for reasoning models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=2jwAjomEDB

work page 2025
[4]

Y. Gu, Z. Jiang, J. Jin, K. Guo, Z. Zhang, and X. Xu. AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models.arXiv preprint arXiv:2506.03762, 2025

work page arXiv 2025
[5]

A. Chen, R. Geh, A. Grover, G. Van den Broeck, and D. Israel. The Pitfalls of KV Cache Compression. arXiv preprint arXiv:2510.00231, 2025

work page arXiv 2025
[6]

H. Chen, X. Liu, Y. Liu, J. Jiang, B. He, and X. Liu. FleetOpt: Analytical fleet provisioning for LLM inference with compress-and-route as implementation mechanism.arXiv preprint arXiv:2603.16514, 2026

work page arXiv 2026
[7]

Clark, U

K. Clark, U. Khandelwal, O. Levy, and C. D. Manning. What does BERT look at? an analysis of BERT’s attention. InProceedings of the 2019 ACL workshop BlackboxNLP: analyzing and interpreting neural networks for NLP, pages 276–286, 2019

work page 2019
[8]

Y. An, C. Lu, K. Zhu, T. Yu, C. Zhao, H. Wu, M. Tang, and J. Wang. ReST-KV: Robust KV cache eviction with layer-wise output reconstruction and spatial-temporal smoothing. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[9]

Y. Bai, Q. Dong, T. Jiang, X. Lv, Z. Du, A. Zeng, J. Tang, and J. Li. IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse.arXiv preprint arXiv:2603.12201, 2026

work page arXiv 2026
[10]

KV Cache Offloading for Context-Intensive Tasks

A. Bocharnikov, I. Ermakov, D. Kuznedelev, V. Zhdanovskiy, and Y. Yershov. KV Cache Offloading for Context-Intensive Tasks.arXiv preprint arXiv:2604.08426, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

K. Team, G. Chen, Y. Zhang, J. Su, W. Xu, S. Pan, Y. Wang, Y. Wang, G. Chen, B. Yin, Y. Chen, J. Yan, M. Wei, et al. Attention residuals.arXiv preprint arXiv:2603.15031, 2026

work page arXiv 2026
[12]

J. Yuan, H. Gao, D. Dai, J. Luo, L. Zhao, Z. Zhang, Z. Xie, Y. Wei, L. Wang, Z. Xiao, Y. Wang, C. Ruan, M. Zhang, W. Liang, and W. Zeng. Native sparse attention: Hardware-aligned and natively trainable sparse attention. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23078–23097, 2025

work page 2025
[13]

DeepSeek-V4: Towards highly efficient million-token context intelligence

DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence. Techni- cal report, 2026. URLhttps://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/resolve/ main/DeepSeek_V4.pdf

work page 2026
[14]

Expected attention: KV cache compres- sion by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025

A. Devoto, M. Jeblick, and S. Jégou. Expected attention: Kv cache compression by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025

work page arXiv 2025
[15]

Z. Dong, P. Liu, J. Li, Z. Chen, H. Peng, S. Wang, and W. X. Zhao. ForesightKV: Optimizing KV cache eviction for reasoning models by learning long-term contribution.arXiv preprint arXiv:2602.03203, 2026. 10

work page arXiv 2026
[16]

R. Goel, J. Park, M. Gagrani, D. Jones, M. Morse, H. Langston, M. Lee, and C. Lott. CAOTE: KV cache selection for LLMs via attention output error-based token eviction.arXiv preprint arXiv:2504.14051, 2025

work page arXiv 2025
[17]

Y. Feng, J. Lv, Y. Cao, X. Xie, and S. K. Zhou. Ada-KV: Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=tcisuhGsQZ

work page 2025
[18]

Y. Feng, J. Lv, Y. Cao, X. Xie, and S. K. Zhou. Identify critical KV cache in LLM inference from an output perturbation perspective.arXiv preprint arXiv:2502.03805, 2025

work page arXiv 2025
[19]

S. Ge, Y. Zhang, L. Liu, M. Zhang, J. Han, and J. Gao. Model tells you what to discard: Adaptive KV cache compression for LLMs. InInternational Conference on Learning Representations, 2024

work page 2024
[20]

Gu and T

A. Gu and T. Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. InFirst Con- ference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=tEYskw1VY2

work page 2024
[21]

X. Li, X. Jin, and L. Zhang. GraphKV: Breaking the static selection paradigm with graph-based KV cache eviction. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 21899–21909, Suzhou, China, 2025. Association for Computational Linguistics

work page 2025
[22]

Coleman Richard Charles Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, S. Shao, K. Keutzer, and A. Gholami. KVQuant: Towards 10 million context length LLM inference with KV cache quantization. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.https: //openreview.net/forum?id=0LXotew9Du

work page 2024
[23]

Huang, H

K. Huang, H. Meng, J. Wu, J. Lu, C. Ma, Z. Chen, X. Wang, B. Ding, J. Wu, X. Wang, X. He, G. Wang, J. Zhou. Beyond Magnitude: Leveraging Direction of RLVR Updates for LLM Reasoning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview. net/forum?id=r6Pw3RiMYL

work page 2026
[24]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024. COLM 2024. URLhttps://arxiv.org/abs/2404.06654

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

NeedleBench: Evaluat- ing LLM retrieval and reasoning across varying information densities.Transactions on Machine Learning Research, 2025

Mo Li, Songyang Zhang, Taolin Zhang, Haodong Duan, Yunxin Liu, and Kai Chen. NeedleBench: Evaluat- ing LLM retrieval and reasoning across varying information densities.Transactions on Machine Learning Research, 2025. URLhttps://mlanthology.org/tmlr/2025/li2025tmlr-needlebench/

work page 2025
[26]

Y. Li, Y. Huang, B. Yang, B. Venkitesh, A. Locatelli, H. Ye, T. Cai, P. Lewis, and D. Chen. SnapKV: LLM knows what you are looking for before generation. InAdvances in Neural Information Processing Systems, volume 37, pages 22947–22970, 2024. Curran Associates, Inc

work page 2024
[27]

Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V. Braverman, B. Chen, and X. Hu. KIVI: A tuning-free asymmetric 2bit quantization for KV cache. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 32332–32344. PMLR, 2024

work page 2024
[28]

P. Liu, J. Liu, X. Qiu, and X. Huang. Beyond Attention Magnitude: Leveraging Inter-layer Rank Consistency for Efficient Vision-Language-Action Models.arXiv preprint arXiv:2603.24941, 2026

work page arXiv 2026
[29]

L. Lu, K. Qiu, J. Zhou, J. Kai, H. Zhang, H. Wang, J. Leng, Z. He, and Z. Lin. One size does not fit all: Token-wise adaptive compression for KV cache.arXiv preprint arXiv:2603.04411, 2026

work page arXiv 2026
[30]

Nawrot, A

P. Nawrot, A. Łańcucki, M. Chochowski, D. Tarjan, and E. M. Ponti. Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference. InForty-first International Conference on Machine Learning,

work page
[31]

URLhttps://openreview.net/forum?id=tDRYrAkOB7. 11

work page
[32]

M. Oren, M. Hassid, N. Yarden, Y. Adi, and R. Schwartz. Transformers are multi-state RNNs. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18724–18741. Association for Computational Linguistics, 2024

work page 2024
[33]

H. Tang, Y. Lin, J. Lin, Q. Han, D. Ke, S. Hong, Y. Yao, and G. Wang. RazorAttention: Efficient KV cache compression through retrieval heads. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[34]

Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference

R. Taniguchi, Y. Dong, M. Onizuka, and C. Xiao. Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference.arXiv preprint arXiv:2601.07667, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Chang, W.-C

C.-C. Chang, W.-C. Lin, C.-Y. Lin, C.-Y. Chen, Y.-F. Hu, P.-S. Wang, N.-C. Huang, L. Ceze, M. S. Abdelfattah, and K.-C. Wu. Palu: KV-cache compression with low-rank projection. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[36]

J. Y. Yang, B. Kim, J. Bae, B. Kwon, G. Park, E. Yang, S. J. Kwon, and D. Lee. No token left behind: Reliable KV cache compression via importance-aware mixed precision quantization.arXiv preprint arXiv:2402.18096, 2024

work page arXiv 2024
[37]

J.-H. Kim, J. Kim, S. Kwon, J. W. Lee, S. Yun, and H. O. Song. KVzip: Query-agnostic KV cache compression with context reconstruction. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[38]

Z. Qin, Y. Cao, M. Lin, W. Hu, S. Fan, K. Cheng, W. Lin, and J. Li. CAKE: Cascading and adaptive KV cache eviction with layer preferences. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[39]

ARKV: Adaptive and resource-efficient KV cache man- agement under limited memory budget for long-context inference in LLMs.arXiv preprint arXiv:2603.08727, 2026

J. Lei and S. Ilager. ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs.arXiv preprint arXiv:2603.08727, 2026

work page arXiv 2026
[40]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

A. Sood, T. Sharma, and V. Agrawal. More Than a Quick Glance: Overcoming the Greedy Bias in KV-Cache Compression.arXiv preprint arXiv:2602.02199, 2026

work page arXiv 2026
[42]

Z. Su, H. Zhang, W. Wu, Y. Zhang, Y. Liu, H. Xiao, Q. Yang, Y. Sun, R. Yang, C. Zhang, K. Fan, W. Ye, J. Xiong, H. Shen, C. Tao, T. Wu, Z. Wan, Y. Qian, Y. Xie, and N. Wong. Attention sink in transformers: A survey on utilization, interpretation, and mitigation.arXiv preprint arXiv:2604.10098, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

Do LLMs Encode Functional Importance of Reasoning Tokens?

J. Singh and D. Hakkani-Tür. Do LLMs Encode Functional Importance of Reasoning Tokens?arXiv preprint arXiv:2601.03066, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

A. Q. Jiang et al. Mistral 7B.arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Z. Tian, Y. Su, J. Li, and M. Zhang. Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries.arXiv preprint arXiv:2603.11564, 2026

work page arXiv 2026
[47]

G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis. Efficient Streaming Language Models with Attention Sinks. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=NG7sS51zVF

work page 2024
[48]

Z. Guo, H. Kamigaito, and T. Watanabe. Attention score is not all you need for token importance indicator in KV cache reduction: Value also matters. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21158–21166, 2024. 12

work page 2024
[49]

K. Zhao, W. Yuan, Y. Lin, L. Ruan, X. Lu, D.-P. Fan, M.-M. Cheng, and D. Zeng. Attention Debiasing for Token Pruning in Vision Language Models.arXiv preprint arXiv:2508.17807, 2025

work page arXiv 2025
[50]

Y. Feng, H. Guo, J. Lv, S Kevin Zhou, and X. Xie. DefensiveKV: Taming the Fragility of KV Cache Eviction in LLM Inference. InThe Fourteenth International Conference on Learning Representations,

work page
[51]

URLhttps://openreview.net/forum?id=nJgS06sX3O

work page
[52]

J. Ahn, I. Seong, A. Kedia, J. Kim, H. Jang, K. Lee, and Y. Jeon. LookaheadKV: Fast and accurate KV cacheevictionbyglimpsingintothefuturewithoutgeneration. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=RVLMGPXt2i

work page 2026
[53]

Y. Wang, S. Ji, Y. Liu, Y. Xu, Y. Xu, Q. Zhu, and W. Che. Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 34146–34162, 2025

work page 2025
[54]

Zhang, Y

Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y. Tian, C. Re, C. Barrett, Z. Wang, and B. Chen. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps: //openreview.net/forum?id=RkRrPp7GKO. 13 Appendix •Appendix A: notatio...

work page 2023
[55]

The largest adjacent transitions occur in Qwen3 TREC, Qwen3 passage retrieval, and Mistral TREC, with ranges from6.250to13.281pp

Budget0.10is the strongest fixed default, with+1.066pp, 22 wins, 10 losses, and 4 ties, but it captures only55 .5%of the per-trajectory oracle-best mean and is exact-best or tied in only 16 of 36 trajectories. The largest adjacent transitions occur in Qwen3 TREC, Qwen3 passage retrieval, and Mistral TREC, with ranges from6.250to13.281pp. 34 0.02 0.05 0.10...

work page