pith. sign in

arxiv: 2605.18053 · v1 · pith:AHHS2V7Onew · submitted 2026-05-18 · 💻 cs.LG · cs.CL· cs.CR· cs.PF

Protection Is (Nearly) All You Need: Structural Protection Dominates Scoring in Globally Capped KV Eviction

Pith reviewed 2026-05-20 12:08 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CRcs.PF
keywords KV cache evictionLLM long-context inferenceattention scoringcache managementtransformer memory optimization
0
0 comments X

The pith

Structural protection at prompt boundaries dominates scoring policies in KV cache eviction for long-context LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies KV cache eviction policies under a shared globally capped decode-time harness across seven models and LongBench tasks. Without reserving space at prompt boundaries, policies such as LRU, H2O, SnapKV, and others drop to near-zero quality because they fail to retain structurally critical tokens. Adding protection for 10% of cache at each boundary recovers 69-90% of the full-cache reference quality even at 13% overall retention, and a wider ten-model panel reaches 68-98%. The work shows that once boundaries are guarded, differences between attention-based scorers become secondary, with protection itself carrying most of the performance lift.

Core claim

Under a shared globally capped decode-time harness, seven eviction policies share a prompt-boundary vulnerability that causes them to collapse without structural protection. Reserving 10% of cache at each boundary recovers 69-90% of the C=2048 reference-ceiling quality on seven LongBench models at C=256 (13% retention); a ten-model panel spans 68-98%. An attention-mass pilot shows the position-0 sink holds about 75% of prefix mass while other boundary tokens sit near 0.41 times uniform expectation. With protection, simplified score-isolation variants are TOST-equivalent to LRU at K=32, attention policies converge yet beat LRU at K=8, and faithful per-head scoring adds modest further gains. A

What carries the argument

Structural protection mechanism that reserves a fixed fraction of cache slots at prompt boundaries to safeguard the high-mass sink token and other boundary positions.

If this is right

  • Simplified attention variants become statistically equivalent to LRU once protection is added at K=32.
  • Attention policies still outperform LRU by 0.011-0.021 F1 at K=8 across C=256 and C=512.
  • Faithful per-head versions of Ada-KV and QUEST add an extra 0.03-0.04 F1 on models such as Mistral-7B.
  • Protection lifts transfer with ratio 0.99-1.00 between decode and prefill regimes in the NIAH-32K pilot.
  • Per-head allocation supplies a further modest gain on top of boundary protection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers of future eviction systems should treat boundary reservation as the default first step rather than an optional add-on.
  • The same protection ratio may need adjustment when moving to contexts much longer than 64K where recovery becomes modest.
  • Combining boundary protection with dynamic per-layer or per-head budgets could be tested to see if gains compound beyond the reported modest improvements.

Load-bearing premise

The six pure-transformer models and LongBench tasks used are representative of broader LLM usage patterns and the observed attention-mass distribution generalizes.

What would settle it

A new model or task where quality stays near zero at 13% retention even after applying the 10% boundary reservation would show the protection effect does not hold.

Figures

Figures reproduced from arXiv: 2605.18053 by Gabriel Garcia.

Figure 1
Figure 1. Figure 1: Per-item F1 distributions for LRU at C=256 (Qwen2.5-3B, N=162). Without protection (red), 96% of items score near zero. Protection (blue) shifts roughly half the distribution to substantive F1, with 15% of items recovering to ≥ 0.95. Of the 78 protected items still below F1 < 0.10, 83% also score zero under the full cache, indicating model inability rather than a protection failure. Universality of failure… view at source ↗
Figure 2
Figure 2. Figure 2: Cache capacity vs. quality for protected (solid) and unprotected (dashed) policies on four [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Diminishing returns of protection. F1 vs. protection fraction (LRU at [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-domain protection effect at 11K context (Qwen2.5-3B, [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) Protection recovery (% of full-cache ceiling) across context lengths (1.9K, 11K, 32K, [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Position-dependent retrieval at 64K tokens (Qwen3-4B-Instruct-2507, 60 NIAH items, [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗
read the original abstract

We study KV cache eviction under a shared globally capped decode-time harness. Seven policies (LRU, H2O, SnapKV, StreamingLLM, Ada-KV, QUEST, Random) share a prompt-boundary vulnerability: without structural protection, they collapse to near-zero quality on six pure-transformer models (F1$\leq$0.064). Reserving 10\% of cache at each boundary recovers 69--90\% of the $C{=}2{,}048$ reference-ceiling quality on seven LongBench models at $C{=}256$ (13\% retention); a ten-model panel spans 68--98\%. An attention-mass pilot (Qwen2.5-3B, $N{=}30$) suggests why: the position-0 sink holds ${\sim}75\%$ of prefix mass, while other boundary tokens sit near ${\sim}0.41{\times}$ uniform expectation, so attention scorers retain the sink but still drop structurally critical tokens. With protection, simplified score-isolation variants are TOST-equivalent to LRU at $K{=}32$ ($\Delta{=}0.02$); at $K{=}8$, attention policies pairwise converge yet beat LRU by 0.011--0.021 F1 across $C{=}256$ and $C{=}512$. Faithful Ada-KV/QUEST add ${\sim}0.03$--$0.04$ F1 on Mistral-7B and Phi-3.5 beyond simplified variants. A NIAH-32K regime-transfer pilot on Qwen3-4B (decode vs.\ prefill, $C{\in}\{512,2048\}$) shows near-identical protection lifts (ratio 0.99--1.00). At 64K, protection helps but recovery is modest; faithful per-head scoring matches full-cache ceiling on Gemma-3-4B at 6.3\% retention only when the model already supports strong 64K retrieval without eviction. Overall: protection dominates; scoring differences are secondary once boundaries are guarded; per-head allocation gives a further modest gain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper examines KV cache eviction under a globally capped decode-time harness across seven policies (LRU, H2O, SnapKV, StreamingLLM, Ada-KV, QUEST, Random). It reports that without structural protection these policies collapse to near-zero quality (F1 ≤ 0.064) on six pure-transformer models. Reserving 10% of cache at each boundary recovers 69–90% of the C=2048 reference quality at C=256 (13% retention) on seven LongBench models and 68–98% on a ten-model panel. An attention-mass pilot on Qwen2.5-3B (N=30) is used to explain the boundary vulnerability: the position-0 sink holds ~75% of prefix mass while other boundary tokens receive ~0.41× uniform attention, so scorers keep the sink but drop other critical tokens. With protection, simplified score-isolation variants become TOST-equivalent to LRU at K=32 and attention policies converge yet outperform LRU at K=8; faithful variants add modest gains. A NIAH-32K transfer pilot shows near-identical protection lifts.

Significance. If the empirical recovery percentages hold, the result is significant for efficient long-context inference: it shows that a lightweight structural safeguard can recover most performance lost to aggressive eviction, rendering many scoring refinements secondary once boundaries are protected. This offers a practical, low-overhead lever for memory-constrained deployment and shifts emphasis from per-token attention heuristics toward position-aware cache allocation.

major comments (1)
  1. [Abstract / attention-mass pilot] The mechanistic explanation that attention scorers 'retain the sink but still drop structurally critical tokens' rests on the attention-mass pilot (position-0 ~75% prefix mass, other boundaries ~0.41× uniform) reported only for Qwen2.5-3B (N=30). The main results and dominance claim span seven LongBench models (including Mistral-7B, Phi-3.5, Gemma-3-4B) and a ten-model panel, yet no equivalent mass distributions are provided for these models. If sink mass or boundary under-attention differs materially, the account of why protection recovers 69–90% while scoring differences remain secondary would not generalize uniformly. This is load-bearing for the paper's interpretation of the results.
minor comments (2)
  1. [Results and experimental protocol] Reported F1 scores, recovery percentages, and TOST equivalence claims lack error bars, standard deviations, or confidence intervals, and the abstract omits exact data-exclusion rules and full experimental protocol details needed for reproducibility.
  2. [Method / protection mechanism] The precise definition and implementation of '10% of cache at each boundary' (e.g., which tokens count as boundaries, whether reservation is global or per-head, and how it scales with C=256 vs. C=512) should be stated explicitly, ideally with pseudocode or a small diagram.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical significance of the structural protection result. We address the single major comment below and clarify the role of the attention-mass pilot relative to the empirical claims.

read point-by-point responses
  1. Referee: [Abstract / attention-mass pilot] The mechanistic explanation that attention scorers 'retain the sink but still drop structurally critical tokens' rests on the attention-mass pilot (position-0 ~75% prefix mass, other boundaries ~0.41× uniform) reported only for Qwen2.5-3B (N=30). The main results and dominance claim span seven LongBench models (including Mistral-7B, Phi-3.5, Gemma-3-4B) and a ten-model panel, yet no equivalent mass distributions are provided for these models. If sink mass or boundary under-attention differs materially, the account of why protection recovers 69–90% while scoring differences remain secondary would not generalize uniformly. This is load-bearing for the paper's interpretation of the results.

    Authors: We agree that the attention-mass pilot is reported only for Qwen2.5-3B and that extending the distributional analysis would strengthen the mechanistic narrative. However, the core claims of the paper are empirical rather than mechanistic: across all seven LongBench models and the ten-model panel, removing structural protection causes every scoring policy to collapse to near-zero quality (F1 ≤ 0.064), while reserving 10 % of the cache at each boundary recovers 69–98 % of the C=2048 reference quality at 13 % retention. These recovery percentages and the secondary role of scoring differences are observed directly and do not depend on the exact mass numbers from the pilot. The pilot serves only to supply intuition for why boundary tokens are vulnerable on the model where it was measured. We will add attention-mass distributions for at least two additional models (Mistral-7B and Phi-3.5) in the revised manuscript to test consistency of the sink and boundary-attention pattern. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparisons with no derivations or self-referential fits

full rationale

The manuscript contains no equations, derivations, or parameter-fitting steps that could reduce to self-definition or fitted-input-called-prediction. All reported results (F1 recoveries, TOST equivalence, attention-mass observations) are direct experimental measurements across fixed policies, models, and cache sizes. The Qwen2.5-3B pilot is presented as an explanatory observation rather than an input that is then redefined as an output; no quantity is constructed from itself. No self-citations are invoked to justify uniqueness or ansatzes. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on empirical observation rather than derivation; the 10% reservation ratio is a design parameter chosen to demonstrate the protection effect.

free parameters (1)
  • boundary reservation ratio
    Fixed at 10% to isolate the structural-protection effect; directly controls the reported recovery percentages.
axioms (1)
  • domain assumption The six tested transformer models and LongBench tasks are representative of typical long-context usage.
    Invoked to support generalization of the protection benefit beyond the reported panel.

pith-pipeline@v0.9.0 · 5946 in / 1320 out tokens · 59032 ms · 2026-05-20T12:08:54.858635+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 15 internal anchors

  1. [1]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing systems, pages 5998–6008, 2017

  2. [2]

    Efficiently scaling transformer inference.Proceedings of Machine Learning and Systems, 2023

    Reiner Pope, Sholto Li, Adam Mohamed, Matej Baji´c, Thomas Paine, Kevin Leyton-Brown, Milad Alizadeh, Devin Kreuzer, et al. Efficiently scaling transformer inference.Proceedings of Machine Learning and Systems, 2023

  3. [3]

    H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

    Zhenyu Zhang, Jie Ren, Yujie Liu, Tianyi Zhu, Tianle Zhang, Ji Liu, Jiang Bian, Shizhe Diao, Boxin Zhang, Wenhao Wang, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.arXiv preprint arXiv:2306.14048, 2023

  4. [4]

    SnapKV: LLM Knows What You are Looking for Before Generation

    Suyu Ge, Yisen Zhang, Yifei Wang, Huan Zheng, Yuxuan Zhou, et al. Snapkv: Llm knows what you are looking for before generation.arXiv preprint arXiv:2404.14469, 2024

  5. [5]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

  6. [6]

    Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time.arXiv preprint arXiv:2305.17118, 2023

    Zhenyu Liu, Aditya Desai, Feng Luo, Weiran Zhu, Yikang Shen, et al. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time.arXiv preprint arXiv:2305.17118, 2023

  7. [7]

    Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

    Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S. Kevin Zhou. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550, 2024. 29

  8. [8]

    Quest: Query-aware sparsity for efficient long-context llm inference.arXiv preprint arXiv:2402.05074, 2024

    Yehui Tang, Xin Chen, Ming Li, et al. Quest: Query-aware sparsity for efficient long-context llm inference.arXiv preprint arXiv:2402.05074, 2024

  9. [9]

    arXiv preprint arXiv:2512.03324

    Yuhan Li et al. Cache what lasts: Token retention for memory-bounded KV cache in LLMs. arXiv preprint arXiv:2512.03324, 2025. ICLR 2026

  10. [10]

    ForesightKV: Optimizing KV cache eviction for reasoning models by learning long-term contribution.arXiv preprint arXiv:2602.03203, 2026

    Yuming Zhang et al. ForesightKV: Optimizing KV cache eviction for reasoning models by learning long-term contribution.arXiv preprint arXiv:2602.03203, 2026

  11. [11]

    Pyramid- Infer: Pyramid kv cache compression for high-throughput llm inference.arXiv preprint arXiv:2405.12532, 2024

    Dongjie Yang, Xiaodong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. Pyramid- Infer: Pyramid kv cache compression for high-throughput llm inference.arXiv preprint arXiv:2405.12532, 2024

  12. [12]

    InfLLM: Unveiling the intrinsic capacity of LLMs for understanding extremely long sequences with training-free memory.arXiv preprint arXiv:2402.04617, 2024

    Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, and Maosong Sun. InfLLM: Unveiling the intrinsic capacity of LLMs for understanding extremely long sequences with training-free memory.arXiv preprint arXiv:2402.04617, 2024

  13. [13]

    Nair, Ilya Soloveychik, and Purushotham Kamath

    Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, and Purushotham Kamath. Keyformer: KV cache reduction through key tokens selection for efficient generative inference. InProceedings of Machine Learning and Systems (MLSys), 2024

  14. [14]

    RazorAttention: Efficient KV cache compression through retrieval heads.arXiv preprint arXiv:2407.15891, 2024

    Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, and Gongyi Wang. RazorAttention: Efficient KV cache compression through retrieval heads.arXiv preprint arXiv:2407.15891, 2024

  15. [15]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Zirui Liu et al. Kivi: A tuning-free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750, 2024

  16. [16]

    W., Shao, Y

    Coleman Hooper, Sehoon Kim, Michael W Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. KVQuant: Towards 10 million context length llm inference with kv cache quantization.arXiv preprint arXiv:2401.18079, 2024

  17. [17]

    Mikv: An adaptive kv cache compression via mixed-precision quantization

    Zirui Liu et al. Mikv: An adaptive kv cache compression via mixed-precision quantization. arXiv preprint, 2024. ICLR 2025

  18. [18]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Yingsheng Yu, Minsoo Hui, Sungjun Hwang, Youngjin Jin, Jimin Lee, et al. Efficient memory management for large language model serving with pagedattention.arXiv preprint arXiv:2309.06180, 2023

  19. [19]

    Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 2022

    Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 2022

  20. [20]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

  21. [21]

    Competitive caching with machine learned advice

    Thodoris Lykouris and Sergei Vassilvitskii. Competitive caching with machine learned advice. Journal of the ACM, 68(4):1–25, 2021

  22. [22]

    Qwen2.5 Technical Report

    Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

  23. [23]

    Qwen3.5: Towards native multimodal agents

    Qwen Team. Qwen3.5: Towards native multimodal agents. https://qwen.ai/blog?id=qw en3.5, February 2026. Hugging Face checkpoint:Qwen/Qwen3.5-27B

  24. [24]

    Qwen3 Technical Report

    Qwen Team. Qwen3 Technical Report, 2025. URL https://arxiv.org/abs/2505.09388

  25. [25]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadalla, Hany Awadallah, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Beez, et al. Phi-3 technical re- port: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219, 2024

  26. [26]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    Marah Abdin, Sahil Agarwal, Aman Agrawal, Garima Bansal, Harkirat Barcinas, Harsha S. Behl, et al. Phi-4-Mini Technical Report, 2025. URLhttps://arxiv.org/abs/2503.01743. 30

  27. [27]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Timoth´ee Lavril, Thomas Wang, Timoth´ee Lacroix, and William El Sayed. Mistral 7b.arXiv preprint arXi...

  28. [28]

    Yi: Open Foundation Models by 01.AI

    Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. Yi: Open foundation models by 01.ai.arXiv preprint arXiv:2403.04652, 2024

  29. [29]

    Gemma 3 technical report.https://goo.gle/Gemma3Report, 2025

    Gemma Team. Gemma 3 technical report.https://goo.gle/Gemma3Report, 2025

  30. [30]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding.arXiv preprint arXiv:2308.14508, 2023

  31. [31]

    A simple sequentially rejective multiple test procedure.Scandinavian Journal of Statistics, 6(2):65–70, 1979

    Sture Holm. A simple sequentially rejective multiple test procedure.Scandinavian Journal of Statistics, 6(2):65–70, 1979

  32. [32]

    Schuirmann

    Donald J. Schuirmann. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability.Journal of Pharmacokinetics and Biopharmaceutics, 15(6):657–680, 1987

  33. [33]

    Equivalence tests: A practical primer for t tests, correlations, and meta-analyses

    Dani¨el Lakens. Equivalence tests: A practical primer for t tests, correlations, and meta-analyses. Social Psychological and Personality Science, 8(4):355–362, 2017

  34. [34]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

  35. [35]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Joshua Ainslie, Dhruv Kale, David Yang, et al. RULER: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024. 31 A JAX decode-step latency (real forward pass) For each autoregressive step after prefill, we time only the JIT-compiled model(...) call that consumes ...