pith. machine review for the scientific record. sign in

arxiv: 2605.08840 · v1 · submitted 2026-05-09 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:48 UTC · model grok-4.3

classification 💻 cs.CL
keywords KV cache evictionlong context modelingLLM inference optimizationattention mechanismsmemory reductionoutput reconstruction
0
0 comments X

The pith

ReST-KV improves KV cache eviction by minimizing layer-wise output discrepancies to account for attention redistribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ReST-KV to handle the memory demands of KV caches in long-context LLM inference. Traditional methods keep high-attention tokens but fail to consider how eviction alters attention distribution across remaining tokens. ReST-KV sets up eviction as an optimization to reduce differences in model outputs by reconstructing them layer by layer, directly modeling the removal effects. It incorporates exponential moving average for handling changes over time and adaptive windows for spatial patterns in token importance. This approach yields higher accuracy on long benchmarks and much lower latency during decoding.

Core claim

ReST-KV formulates the KV cache eviction task as an optimization problem that minimizes output discrepancies using efficient layer-wise reconstruction, thereby capturing attention redistribution effects, and enhances robustness with spatial-temporal smoothing mechanisms.

What carries the argument

The layer-wise output reconstruction, which approximates the impact of each token removal on the model's output at each layer without full recomputation, combined with exponential moving average smoothing for temporal variations and adaptive window-based smoothing for spatial patterns.

If this is right

  • It achieves 2.58% higher performance on LongBench and 15.2% on RULER compared to state-of-the-art baselines.
  • It outperforms prior methods on Needle-in-a-Haystack and InfiniteBench tests.
  • It provides a 10.61 times reduction in decoding latency when handling 128k context lengths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This reconstruction-based approach might generalize to other memory optimization problems in neural networks beyond just KV caches.
  • The spatial and temporal smoothing components could be adapted for real-time applications where context changes dynamically.
  • If the method scales, it may enable practical deployment of LLMs with even longer contexts than currently feasible.

Load-bearing premise

Efficient layer-wise output reconstruction accurately reflects the complete effect of removing any token on the final model output without significant approximation errors that accumulate.

What would settle it

Running the full model with and without the evicted tokens and observing if the actual output difference matches the reconstructed difference used for eviction decisions on a long input sequence.

Figures

Figures reproduced from arXiv: 2605.08840 by Chang Lu, Chaoyang Zhao, Hong Wu, Jinqiao Wang, Kuan Zhu, Ming Tang, Tao Yu, Yongqi An.

Figure 1
Figure 1. Figure 1: Comparison between ReST-KV and existing methods. Unlike prior approaches that over [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of ReST-KV. (a) Layer-wise output reconstruction quantifies each KV pairs [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization analysis of the spatial-temporal dynamics of the output reconstruction in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average score across 16 datasets of LongBench under various cache budgets. ReST-KV [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance comparison on the Needle in a Haystack Test using Mistral-7B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Peak memory usage and decoding latency on NVIDIA A800 80GB GPU. ReST-KV re [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sensitivity analysis of the smoothing factor [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance comparison on the Needle in a Haystack Test using Mistral-7B-Instruct-v0.3 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance comparison on the Needle in a Haystack Test using Llama3.1-8B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Performance comparison on the Needle in a Haystack Test using Llama3.1-8B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of ReST-KV, KV cache quantization methods (KIVI and KVQuant), and [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
read the original abstract

Large language models (LLMs) face growing challenges in efficient generative inference due to the increasing memory demands of Key-Value (KV) caches, especially for long sequences. Existing eviction methods typically retain KV pairs with high attention weights but overlook the impact of attention redistribution caused by token removal, as well as the spatial-temporal dynamics in KV selection. In this paper, we propose ReST-KV, a robust KV eviction method that combines layer-wise output Reconstruction and Spatial-Temporal smoothing to provide a more comprehensive perspective for the KV cache eviction task. Specifically, ReST-KV formulates KV cache eviction as an optimization problem that minimizes output discrepancies through efficient layer-wise reconstruction. By directly modeling how each token's removal affects the model output, our method naturally captures attention redistribution effects, going beyond simplistic reliance on raw attention weights. To further enhance robustness, we design exponential moving average smoothing to handle temporal variations and an adaptive window-based mechanism to capture spatial patterns. Our method, ReST-KV, significantly advances performance on long-context benchmarks. It surpasses state-of-the-art baselines by 2.58% on LongBench and 15.2% on RULER. Additionally, ReST-KV consistently outperforms existing methods on Needle-in-a-Haystack and InfiniteBench, all while achieving a remarkable 10.61$\times$ reduction in decoding latency at 128k context length. The code is publicly available at https://github.com/an-yongqi/rest-kv to facilitate reproducibility and further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper proposes ReST-KV, a KV cache eviction method for LLMs that formulates the task as an optimization problem minimizing output discrepancies via efficient layer-wise output reconstruction. This is augmented with exponential moving average smoothing for temporal variations and an adaptive window mechanism for spatial patterns. The approach claims to capture attention redistribution effects beyond raw attention weights, yielding 2.58% gains over SOTA on LongBench, 15.2% on RULER, consistent improvements on Needle-in-a-Haystack and InfiniteBench, and a 10.61× decoding latency reduction at 128k context length, with public code release.

Significance. If the layer-wise reconstruction accurately models downstream effects without substantial approximation error accumulation, the method provides a more principled eviction criterion than attention-weight heuristics and could improve robustness in long-context inference. The concrete benchmark numbers, latency results, and public code are positive for reproducibility and practical utility.

major comments (1)
  1. [§3] §3 (Method, layer-wise reconstruction): the central claim that efficient per-layer output reconstruction captures the full downstream impact of token removal (including attention redistribution) without expensive recomputation lacks any reported validation against exact full forward passes, error bounds, or accumulation analysis across layers. This is load-bearing because unquantified approximation errors can alter eviction decisions in deep transformers, directly affecting the reported benchmark gains.
minor comments (3)
  1. Results section: the reported improvements lack details on statistical significance testing, exact baseline implementations, and hyperparameter search ranges (including the EMA factor and adaptive window size).
  2. Experiments: an ablation isolating the layer-wise reconstruction component from the spatial-temporal smoothing would strengthen the contribution analysis.
  3. Notation: the optimization objective and smoothing parameters should be formalized with explicit equations and variable definitions for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment below and will revise the paper accordingly to strengthen the validation of our method.

read point-by-point responses
  1. Referee: [§3] §3 (Method, layer-wise reconstruction): the central claim that efficient per-layer output reconstruction captures the full downstream impact of token removal (including attention redistribution) without expensive recomputation lacks any reported validation against exact full forward passes, error bounds, or accumulation analysis across layers. This is load-bearing because unquantified approximation errors can alter eviction decisions in deep transformers, directly affecting the reported benchmark gains.

    Authors: We acknowledge the validity of this observation. The manuscript currently does not report direct comparisons of the layer-wise reconstruction to exact full forward passes or provide quantitative error bounds and accumulation analysis. To address this, we will add a new analysis subsection in §3. This will include experiments on representative models and contexts where we compute both the approximated reconstruction and the exact output after token removal, reporting metrics such as output discrepancy (e.g., KL divergence or L2 norm on logits), and analyzing how errors propagate across layers. We will also discuss the impact on eviction decisions and benchmark performance. This addition will substantiate the claim and enhance the paper's rigor. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper formulates KV eviction as an optimization problem solved via layer-wise reconstruction plus EMA/adaptive smoothing, then reports empirical gains on external benchmarks (LongBench +2.58%, RULER +15.2%, Needle-in-a-Haystack, InfiniteBench). These results are obtained by running the algorithm on held-out test sets rather than by algebraic reduction to the same fitted quantities or self-citations. No equations are shown that equate a claimed prediction to its own input by construction, no parameter is fitted on a subset and then relabeled a prediction, and no load-bearing uniqueness theorem or ansatz is imported from prior self-work. The reconstruction step is an algorithmic design choice whose fidelity is an empirical question, not a definitional tautology. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard transformer assumptions plus new modeling choices for reconstruction and smoothing; no new physical entities are postulated.

free parameters (2)
  • exponential moving average factor
    Controls temporal smoothing strength and is expected to be chosen or tuned per model or task.
  • adaptive window size
    Determines spatial neighborhood for pattern capture and is likely a hyperparameter.
axioms (1)
  • domain assumption Layer-wise reconstruction provides a sufficient proxy for the global output change induced by KV eviction
    Invoked to turn eviction into an optimizable objective without full forward passes.

pith-pipeline@v0.9.0 · 5595 in / 1356 out tokens · 43870 ms · 2026-05-12T02:48:09.166708+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 12 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  2. [2]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    URL https://www-cdn. anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3. pdf. Accessed: 2025-01-21. Y ushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508,

  3. [3]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    URL https://github.com/Zefan-Cai/KVCache-Factory . Accessed: 2025-01-21. Zefan Cai, Yichi Zhang, Bofei Gao, Y uliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Y ue Dong, Baobao Chang, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069,

  4. [4]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691,

  5. [5]

    Glm: General language model pretraining with autoregressive blank infilling

    Zhengxiao Du, Y ujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Y ang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360,

  6. [6]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Y ang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

  7. [7]

    arXiv preprint arXiv:2407.11550 , year =

    Y uan Feng, Junlin Lv, Y ukun Cao, Xike Xie, and S Kevin Zhou. Ada-kv: Optimizing kv cache evic- tion by adaptive budget allocation for efficient llm inference. arXiv preprint arXiv:2407.11550 ,

  8. [8]

    Model tells you what to discard: Adaptive kv cache compression for llms

    Suyu Ge, Y unan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801,

  9. [9]

    Attention score is not all you need for token importance indicator in kv cache reduction: V alue also matters

    Zhiyu Guo, Hidetaka Kamigaito, and Taro Watanabe. Attention score is not all you need for token importance indicator in kv cache reduction: V alue also matters. arXiv preprint arXiv:2406.12335,

  10. [10]

    Lm-infinite: Simple on-the-fly length generalization for large language models

    URL https: //arxiv.org/abs/2308.16137. Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. Advances in Neural Information Processing Systems , 37:1270–1303,

  11. [11]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Y ang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654,

  12. [12]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825,

  13. [13]

    Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention

    11 Published as a conference paper at ICLR 2026 Huiqiang Jiang, Y ucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir Abdi, Dongsheng Li, Chin-Y ew Lin, et al. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention. Advances in Neural Information Processing Systems, 37:52481–52515,

  14. [14]

    Flexprefill: A context-aware sparse attention mechanism for efficient long-sequence inference

    Xunhao Lai, Jianqiao Lu, Y ao Luo, Yiyuan Ma, and Xun Zhou. Flexprefill: A context-aware sparse attention mechanism for efficient long-sequence inference. arXiv preprint arXiv:2502.20766 ,

  15. [15]

    arXiv preprint arXiv:2412.19442 , year =

    Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, and Lei Chen. A survey on large language model acceleration based on kv cache management. arXiv preprint arXiv:2412.19442, 2024a. Y uhong Li, Yingbing Huang, Bowen Y ang, Bharat V enkitesh, Acyr Locatelli, Hanchen Y e, Tianle Cai, Patrick Lewis, and De...

  16. [16]

    Accessed: 2025-01-21

    URL https://mistral.ai/news/mistral-large-2407/ . Accessed: 2025-01-21. Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440,

  17. [17]

    Transformers are multi-state RNNs , 2024

    Matanel Oren, Michael Hassid, Nir Y arden, Y ossi Adi, and Roy Schwartz. Transformers are multi- state rnns. arXiv preprint arXiv:2401.06104,

  18. [18]

    Cake: Cascading and adaptive kv cache eviction with layer preferences.arXiv preprint arXiv:2503.12491, 2025

    Ziran Qin, Y uchen Cao, Mingbao Lin, Wen Hu, Shixuan Fan, Ke Cheng, Weiyao Lin, and Jianguo Li. Cake: Cascading and adaptive kv cache eviction with layer preferences. arXiv preprint arXiv:2503.12491,

  19. [19]

    Code Llama: Open Foundation Models for Code

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Y ossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950,

  20. [20]

    Lacache: Ladder-shaped kv caching for efficient long-context modeling of large language models

    Dachuan Shi, Y onggan Fu, Xiangchi Y uan, Zhongzhi Y u, Haoran Y ou, Sixu Li, Xin Dong, Jan Kautz, Pavlo Molchanov, et al. Lacache: Ladder-shaped kv caching for efficient long-context modeling of large language models. arXiv preprint arXiv:2507.14204,

  21. [21]

    arXiv preprint arXiv:2407.15891 , year =

    Hanlin Tang, Y ang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Y ao, and Gongyi Wang. Razorattention: Efficient kv cache compression through retrieval heads. arXiv preprint arXiv:2407.15891,

  22. [22]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295,

  23. [23]

    Qwen Team. Qwen2. 5: A party of foundation models. Qwen (Sept. 2024). url: https://qwenlm. github. io/blog/qwen2, 5,

  24. [24]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Y asmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

  25. [25]

    D2o: Dynamic discriminative oper- ations for efficient long-context inference of large language models.arXiv preprint arXiv:2406.13035, 2024

    Zhongwei Wan, Xinjian Wu, Y u Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, and Mi Zhang. D2o: Dynamic discriminative operations for efficient generative inference of large language models. arXiv preprint arXiv:2406.13035,

  26. [26]

    Efficient Streaming Language Models with Attention Sinks

    12 Published as a conference paper at ICLR 2026 Guangxuan Xiao, Y uandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453,

  27. [27]

    Pyramid- infer: Pyramid kv cache compression for high-throughput llm inference

    Dongjie Y ang, XiaoDong Han, Y an Gao, Y ao Hu, Shilin Zhang, and Hai Zhao. Pyramidinfer: Pyra- mid kv cache compression for high-throughput llm inference. arXiv preprint arXiv:2405.12532 ,

  28. [28]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Y ao, Jeffrey Zhao, Dian Y u, Nan Du, Izhak Shafran, Karthik Narasimhan, and Y uan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 ,

  29. [29]

    Dynamickv: Task-aware adaptive kv cache compression for long context llms

    Xiabin Zhou, Wenbin Wang, Minyan Zeng, Jiaxian Guo, Xuebo Liu, Li Shen, Min Zhang, and Liang Ding. Dynamickv: Task-aware adaptive kv cache compression for long context llms. arXiv preprint arXiv:2412.14838,

  30. [30]

    13 Published as a conference paper at ICLR 2026 A D ERIVATION AND ANALYSIS OF THE OUTPUT RECONSTRUCTION INDICATOR We define the eviction indicator It[n] as the reconstruction error of the MHA output caused by removing the n-th KV pair. Specifically, the eviction indicator is given by: It[n] = MHA (xt, ⟨KT , VT ⟩) − MHA xt, ⟨KT,\n, VT,\n⟩ 2 , (15) where Kt,\...

  31. [31]

    We compared the average accuracy results of the Llama2-7B-Chat model on the LongBench datasets under varying total cache budgets (ranging from 64L to 1024L)

    as a representative of head-wise budget al- location strategies. We compared the average accuracy results of the Llama2-7B-Chat model on the LongBench datasets under varying total cache budgets (ranging from 64L to 1024L). Our ex- periments demonstrate that, when combined with these strategies, our method achieves similar or slightly improved performance ...

  32. [32]

    model architectures, and in Appendix D.3, we present ex- periments on the Llama2-13B-Chat and Llama3-70B-Instruct model sizes. D.1 D ETAILED PERFORMANCE ACROSS CACHE BUDGETS Tables 7, 8, and 9 present the detailed LongBench results of ReST-KV and comparative meth- ods applied to Llama3.1-8B-Instruct, Mistral-7B-Instruct-v0.3, and Llama2-7B-Chat, respectiv...

  33. [33]

    on Llama2-7B-Chat under percentage-based cache budgets (25% and 50%) across all 16 LongBench tasks. As shown in Tables 12 and 13, ReST- KV consistently outperforms LaCache under both budget settings, with the advantage particularly 17 Published as a conference paper at ICLR 2026 Table 9: Performance comparison across 16 datasets of LongBench on Llama2-7B-...

  34. [34]

    The best result is highlighted in bold, and the second-best is underlined . Method Single-Document QA Multi-Document QASummarization Few-shot Learning Synthetic Code Avg.NrtvQAQasperMF-enHotpotQA2WikiMQAMusiqueGovReportQMSumMultiNewsTRECTriviaQASAMSumPCountPReLccRB-P Llama2-7B-Chat,Btotal=F ull Full 18.39 20.11 35.67 31.25 25.50 10.14 25.68 20.93 26.27 64...

  35. [35]

    As shown in the left part of Table 16, various temporal smoothing techniques, including Mean, Inv-EMA, and EMA, are tested

    35.86 22 Published as a conference paper at ICLR 2026 G.2 E FFICACY OF THE PROPOSED SPATIAL -T EMPORAL SMOOTHING To assess the effectiveness of the spatial-temporal smoothing mechanism, we perform an ablation study to examine the impact of different smoothing methods. As shown in the left part of Table 16, various temporal smoothing techniques, including ...

  36. [36]

    Methods such as Avgpool, Maxpool, and our adaptive window-based smoothing (AWS) are com- pared, with AWS achieving the highest average performance. This suggests that the adaptive window-based approach, significantly enhances the eviction indicators ability to adjust for vary- ing window sizes and offsets, thereby improving the assessment of the importance...

  37. [37]

    All results are normalized to the Full KV caching baseline

    Table 17: Efficiency analysis on RULER 128k. All results are normalized to the Full KV caching baseline. Method 128k Avg. Acc. TTFT Decoding Latency Full 79.32 1× 1× ReST-KV 68.28 0.97× 10.61× ReST-KV + MInference 53.71 2.99× 10.41× ReST-KV + FlexPrefill (γ = 0.9) 67.16 3.42× 10.46× ReST-KV + FlexPrefill (γ = 0.95) 68.12 2.37× 10.54× Our method is a KV cache...

  38. [38]

    and FlexPrefill ( Lai et al. , 2025). When combined with these meth- ods, we observe additional gains in TTFT. For example, integrating FlexPrefill with γ = 0 .95 achieves a 2.37Œ TTFT speedup while retaining high decoding efficiency (10.54Œ latency speedup) and competitive accuracy (68.12%). This shows that our approach not only accelerates decoding but als...