arxiv: 2605.08840 · v1 · submitted 2026-05-09 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing

Yongqi An , Chang Lu , Kuan Zhu , Tao Yu , Chaoyang Zhao , Hong Wu , Ming Tang , Jinqiao Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:48 UTC · model grok-4.3

classification 💻 cs.CL

keywords KV cache evictionlong context modelingLLM inference optimizationattention mechanismsmemory reductionoutput reconstruction

0 comments

The pith

ReST-KV improves KV cache eviction by minimizing layer-wise output discrepancies to account for attention redistribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ReST-KV to handle the memory demands of KV caches in long-context LLM inference. Traditional methods keep high-attention tokens but fail to consider how eviction alters attention distribution across remaining tokens. ReST-KV sets up eviction as an optimization to reduce differences in model outputs by reconstructing them layer by layer, directly modeling the removal effects. It incorporates exponential moving average for handling changes over time and adaptive windows for spatial patterns in token importance. This approach yields higher accuracy on long benchmarks and much lower latency during decoding.

Core claim

ReST-KV formulates the KV cache eviction task as an optimization problem that minimizes output discrepancies using efficient layer-wise reconstruction, thereby capturing attention redistribution effects, and enhances robustness with spatial-temporal smoothing mechanisms.

What carries the argument

The layer-wise output reconstruction, which approximates the impact of each token removal on the model's output at each layer without full recomputation, combined with exponential moving average smoothing for temporal variations and adaptive window-based smoothing for spatial patterns.

If this is right

It achieves 2.58% higher performance on LongBench and 15.2% on RULER compared to state-of-the-art baselines.
It outperforms prior methods on Needle-in-a-Haystack and InfiniteBench tests.
It provides a 10.61 times reduction in decoding latency when handling 128k context lengths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This reconstruction-based approach might generalize to other memory optimization problems in neural networks beyond just KV caches.
The spatial and temporal smoothing components could be adapted for real-time applications where context changes dynamically.
If the method scales, it may enable practical deployment of LLMs with even longer contexts than currently feasible.

Load-bearing premise

Efficient layer-wise output reconstruction accurately reflects the complete effect of removing any token on the final model output without significant approximation errors that accumulate.

What would settle it

Running the full model with and without the evicted tokens and observing if the actual output difference matches the reconstructed difference used for eviction decisions on a long input sequence.

Figures

Figures reproduced from arXiv: 2605.08840 by Chang Lu, Chaoyang Zhao, Hong Wu, Jinqiao Wang, Kuan Zhu, Ming Tang, Tao Yu, Yongqi An.

**Figure 2.** Figure 2: Overview of ReST-KV. (a) Layer-wise output reconstruction quantifies each KV pairs [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization analysis of the spatial-temporal dynamics of the output reconstruction in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Average score across 16 datasets of LongBench under various cache budgets. ReST-KV [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Performance comparison on the Needle in a Haystack Test using Mistral-7B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Peak memory usage and decoding latency on NVIDIA A800 80GB GPU. ReST-KV re [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Sensitivity analysis of the smoothing factor [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Performance comparison on the Needle in a Haystack Test using Mistral-7B-Instruct-v0.3 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Performance comparison on the Needle in a Haystack Test using Llama3.1-8B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Performance comparison on the Needle in a Haystack Test using Llama3.1-8B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Comparison of ReST-KV, KV cache quantization methods (KIVI and KVQuant), and [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

read the original abstract

Large language models (LLMs) face growing challenges in efficient generative inference due to the increasing memory demands of Key-Value (KV) caches, especially for long sequences. Existing eviction methods typically retain KV pairs with high attention weights but overlook the impact of attention redistribution caused by token removal, as well as the spatial-temporal dynamics in KV selection. In this paper, we propose ReST-KV, a robust KV eviction method that combines layer-wise output Reconstruction and Spatial-Temporal smoothing to provide a more comprehensive perspective for the KV cache eviction task. Specifically, ReST-KV formulates KV cache eviction as an optimization problem that minimizes output discrepancies through efficient layer-wise reconstruction. By directly modeling how each token's removal affects the model output, our method naturally captures attention redistribution effects, going beyond simplistic reliance on raw attention weights. To further enhance robustness, we design exponential moving average smoothing to handle temporal variations and an adaptive window-based mechanism to capture spatial patterns. Our method, ReST-KV, significantly advances performance on long-context benchmarks. It surpasses state-of-the-art baselines by 2.58% on LongBench and 15.2% on RULER. Additionally, ReST-KV consistently outperforms existing methods on Needle-in-a-Haystack and InfiniteBench, all while achieving a remarkable 10.61$\times$ reduction in decoding latency at 128k context length. The code is publicly available at https://github.com/an-yongqi/rest-kv to facilitate reproducibility and further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReST-KV reframes KV eviction around output reconstruction plus smoothing and shows concrete benchmark and latency gains, but the approximation at its core still needs direct checks.

read the letter

ReST-KV's main contribution is to cast KV cache eviction as an optimization that picks tokens to minimize output discrepancy, using a layer-wise reconstruction step to keep the cost down instead of relying only on raw attention scores. It then adds exponential moving average smoothing over time and an adaptive window for spatial patterns. That combination is the clearest new element relative to the attention-weight baselines it cites, and the released code makes the approach easy to inspect or extend.

Referee Report

1 major / 3 minor

Summary. The paper proposes ReST-KV, a KV cache eviction method for LLMs that formulates the task as an optimization problem minimizing output discrepancies via efficient layer-wise output reconstruction. This is augmented with exponential moving average smoothing for temporal variations and an adaptive window mechanism for spatial patterns. The approach claims to capture attention redistribution effects beyond raw attention weights, yielding 2.58% gains over SOTA on LongBench, 15.2% on RULER, consistent improvements on Needle-in-a-Haystack and InfiniteBench, and a 10.61× decoding latency reduction at 128k context length, with public code release.

Significance. If the layer-wise reconstruction accurately models downstream effects without substantial approximation error accumulation, the method provides a more principled eviction criterion than attention-weight heuristics and could improve robustness in long-context inference. The concrete benchmark numbers, latency results, and public code are positive for reproducibility and practical utility.

major comments (1)

[§3] §3 (Method, layer-wise reconstruction): the central claim that efficient per-layer output reconstruction captures the full downstream impact of token removal (including attention redistribution) without expensive recomputation lacks any reported validation against exact full forward passes, error bounds, or accumulation analysis across layers. This is load-bearing because unquantified approximation errors can alter eviction decisions in deep transformers, directly affecting the reported benchmark gains.

minor comments (3)

Results section: the reported improvements lack details on statistical significance testing, exact baseline implementations, and hyperparameter search ranges (including the EMA factor and adaptive window size).
Experiments: an ablation isolating the layer-wise reconstruction component from the spatial-temporal smoothing would strengthen the contribution analysis.
Notation: the optimization objective and smoothing parameters should be formalized with explicit equations and variable definitions for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment below and will revise the paper accordingly to strengthen the validation of our method.

read point-by-point responses

Referee: [§3] §3 (Method, layer-wise reconstruction): the central claim that efficient per-layer output reconstruction captures the full downstream impact of token removal (including attention redistribution) without expensive recomputation lacks any reported validation against exact full forward passes, error bounds, or accumulation analysis across layers. This is load-bearing because unquantified approximation errors can alter eviction decisions in deep transformers, directly affecting the reported benchmark gains.

Authors: We acknowledge the validity of this observation. The manuscript currently does not report direct comparisons of the layer-wise reconstruction to exact full forward passes or provide quantitative error bounds and accumulation analysis. To address this, we will add a new analysis subsection in §3. This will include experiments on representative models and contexts where we compute both the approximated reconstruction and the exact output after token removal, reporting metrics such as output discrepancy (e.g., KL divergence or L2 norm on logits), and analyzing how errors propagate across layers. We will also discuss the impact on eviction decisions and benchmark performance. This addition will substantiate the claim and enhance the paper's rigor. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper formulates KV eviction as an optimization problem solved via layer-wise reconstruction plus EMA/adaptive smoothing, then reports empirical gains on external benchmarks (LongBench +2.58%, RULER +15.2%, Needle-in-a-Haystack, InfiniteBench). These results are obtained by running the algorithm on held-out test sets rather than by algebraic reduction to the same fitted quantities or self-citations. No equations are shown that equate a claimed prediction to its own input by construction, no parameter is fitted on a subset and then relabeled a prediction, and no load-bearing uniqueness theorem or ansatz is imported from prior self-work. The reconstruction step is an algorithmic design choice whose fidelity is an empirical question, not a definitional tautology. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard transformer assumptions plus new modeling choices for reconstruction and smoothing; no new physical entities are postulated.

free parameters (2)

exponential moving average factor
Controls temporal smoothing strength and is expected to be chosen or tuned per model or task.
adaptive window size
Determines spatial neighborhood for pattern capture and is likely a hyperparameter.

axioms (1)

domain assumption Layer-wise reconstruction provides a sufficient proxy for the global output change induced by KV eviction
Invoked to turn eviction into an optimizable objective without full forward passes.

pith-pipeline@v0.9.0 · 5595 in / 1356 out tokens · 43870 ms · 2026-05-12T02:48:09.166708+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

formulates KV cache eviction as an optimization problem that minimizes output discrepancies through efficient layer-wise reconstruction... It[n] = At[n]/(1-At[n]) ∥MHA(xt,⟨KT,VT⟩)−vnWO∥2
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

exponential moving average smoothing... adaptive window-based mechanism

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 12 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

URL https://www-cdn. anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3. pdf. Accessed: 2025-01-21. Y ushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508,

work page internal anchor Pith review arXiv 2025
[3]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

URL https://github.com/Zefan-Cai/KVCache-Factory . Accessed: 2025-01-21. Zefan Cai, Yichi Zhang, Bofei Gao, Y uliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Y ue Dong, Baobao Chang, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069,

work page internal anchor Pith review arXiv 2025
[4]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Glm: General language model pretraining with autoregressive blank infilling

Zhengxiao Du, Y ujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Y ang, and Jie Tang. Glm: General language model pretraining with autoregressive blank inﬁlling. arXiv preprint arXiv:2103.10360,

work page arXiv
[6]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Y ang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

arXiv preprint arXiv:2407.11550 , year =

Y uan Feng, Junlin Lv, Y ukun Cao, Xike Xie, and S Kevin Zhou. Ada-kv: Optimizing kv cache evic- tion by adaptive budget allocation for efﬁcient llm inference. arXiv preprint arXiv:2407.11550 ,

work page arXiv
[8]

Model tells you what to discard: Adaptive kv cache compression for llms

Suyu Ge, Y unan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801,

work page arXiv
[9]

Attention score is not all you need for token importance indicator in kv cache reduction: V alue also matters

Zhiyu Guo, Hidetaka Kamigaito, and Taro Watanabe. Attention score is not all you need for token importance indicator in kv cache reduction: V alue also matters. arXiv preprint arXiv:2406.12335,

work page arXiv
[10]

Lm-infinite: Simple on-the-fly length generalization for large language models

URL https: //arxiv.org/abs/2308.16137. Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. Advances in Neural Information Processing Systems , 37:1270–1303,

work page arXiv
[11]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Y ang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Minference 1.0: Accelerating pre-ﬁlling for long-context llms via dynamic sparse attention

11 Published as a conference paper at ICLR 2026 Huiqiang Jiang, Y ucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir Abdi, Dongsheng Li, Chin-Y ew Lin, et al. Minference 1.0: Accelerating pre-ﬁlling for long-context llms via dynamic sparse attention. Advances in Neural Information Processing Systems, 37:52481–52515,

work page 2026
[14]

Flexpreﬁll: A context-aware sparse attention mechanism for efﬁcient long-sequence inference

Xunhao Lai, Jianqiao Lu, Y ao Luo, Yiyuan Ma, and Xun Zhou. Flexpreﬁll: A context-aware sparse attention mechanism for efﬁcient long-sequence inference. arXiv preprint arXiv:2502.20766 ,

work page arXiv
[15]

arXiv preprint arXiv:2412.19442 , year =

Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, and Lei Chen. A survey on large language model acceleration based on kv cache management. arXiv preprint arXiv:2412.19442, 2024a. Y uhong Li, Yingbing Huang, Bowen Y ang, Bharat V enkitesh, Acyr Locatelli, Hanchen Y e, Tianle Cai, Patrick Lewis, and De...

work page arXiv
[16]

Accessed: 2025-01-21

URL https://mistral.ai/news/mistral-large-2407/ . Accessed: 2025-01-21. Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efﬁcient inference. arXiv preprint arXiv:1611.06440,

work page arXiv 2025
[17]

Transformers are multi-state RNNs , 2024

Matanel Oren, Michael Hassid, Nir Y arden, Y ossi Adi, and Roy Schwartz. Transformers are multi- state rnns. arXiv preprint arXiv:2401.06104,

work page arXiv
[18]

Cake: Cascading and adaptive kv cache eviction with layer preferences.arXiv preprint arXiv:2503.12491, 2025

Ziran Qin, Y uchen Cao, Mingbao Lin, Wen Hu, Shixuan Fan, Ke Cheng, Weiyao Lin, and Jianguo Li. Cake: Cascading and adaptive kv cache eviction with layer preferences. arXiv preprint arXiv:2503.12491,

work page arXiv
[19]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Y ossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Lacache: Ladder-shaped kv caching for efﬁcient long-context modeling of large language models

Dachuan Shi, Y onggan Fu, Xiangchi Y uan, Zhongzhi Y u, Haoran Y ou, Sixu Li, Xin Dong, Jan Kautz, Pavlo Molchanov, et al. Lacache: Ladder-shaped kv caching for efﬁcient long-context modeling of large language models. arXiv preprint arXiv:2507.14204,

work page arXiv
[21]

arXiv preprint arXiv:2407.15891 , year =

Hanlin Tang, Y ang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Y ao, and Gongyi Wang. Razorattention: Efﬁcient kv cache compression through retrieval heads. arXiv preprint arXiv:2407.15891,

work page arXiv
[22]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Qwen Team. Qwen2. 5: A party of foundation models. Qwen (Sept. 2024). url: https://qwenlm. github. io/blog/qwen2, 5,

work page 2024
[24]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Y asmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and ﬁne-tuned chat models. arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

D2o: Dynamic discriminative oper- ations for efficient long-context inference of large language models.arXiv preprint arXiv:2406.13035, 2024

Zhongwei Wan, Xinjian Wu, Y u Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, and Mi Zhang. D2o: Dynamic discriminative operations for efﬁcient generative inference of large language models. arXiv preprint arXiv:2406.13035,

work page arXiv
[26]

Efficient Streaming Language Models with Attention Sinks

12 Published as a conference paper at ICLR 2026 Guangxuan Xiao, Y uandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efﬁcient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Pyramid- infer: Pyramid kv cache compression for high-throughput llm inference

Dongjie Y ang, XiaoDong Han, Y an Gao, Y ao Hu, Shilin Zhang, and Hai Zhao. Pyramidinfer: Pyra- mid kv cache compression for high-throughput llm inference. arXiv preprint arXiv:2405.12532 ,

work page arXiv
[28]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Y ao, Jeffrey Zhao, Dian Y u, Nan Du, Izhak Shafran, Karthik Narasimhan, and Y uan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 ,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Dynamickv: Task-aware adaptive kv cache compression for long context llms

Xiabin Zhou, Wenbin Wang, Minyan Zeng, Jiaxian Guo, Xuebo Liu, Li Shen, Min Zhang, and Liang Ding. Dynamickv: Task-aware adaptive kv cache compression for long context llms. arXiv preprint arXiv:2412.14838,

work page arXiv
[30]

13 Published as a conference paper at ICLR 2026 A D ERIVATION AND ANALYSIS OF THE OUTPUT RECONSTRUCTION INDICATOR We deﬁne the eviction indicator It[n] as the reconstruction error of the MHA output caused by removing the n-th KV pair. Speciﬁcally, the eviction indicator is given by: It[n] = MHA (xt, ⟨KT , VT ⟩) − MHA xt, ⟨KT,\n, VT,\n⟩ 2 , (15) where Kt,\...

work page 2026
[31]

We compared the average accuracy results of the Llama2-7B-Chat model on the LongBench datasets under varying total cache budgets (ranging from 64L to 1024L)

as a representative of head-wise budget al- location strategies. We compared the average accuracy results of the Llama2-7B-Chat model on the LongBench datasets under varying total cache budgets (ranging from 64L to 1024L). Our ex- periments demonstrate that, when combined with these strategies, our method achieves similar or slightly improved performance ...

work page arXiv 2026
[32]

model architectures, and in Appendix D.3, we present ex- periments on the Llama2-13B-Chat and Llama3-70B-Instruct model sizes. D.1 D ETAILED PERFORMANCE ACROSS CACHE BUDGETS Tables 7, 8, and 9 present the detailed LongBench results of ReST-KV and comparative meth- ods applied to Llama3.1-8B-Instruct, Mistral-7B-Instruct-v0.3, and Llama2-7B-Chat, respectiv...

work page arXiv 2026
[33]

on Llama2-7B-Chat under percentage-based cache budgets (25% and 50%) across all 16 LongBench tasks. As shown in Tables 12 and 13, ReST- KV consistently outperforms LaCache under both budget settings, with the advantage particularly 17 Published as a conference paper at ICLR 2026 Table 9: Performance comparison across 16 datasets of LongBench on Llama2-7B-...

work page 2026
[34]

The best result is highlighted in bold, and the second-best is underlined . Method Single-Document QA Multi-Document QASummarization Few-shot Learning Synthetic Code Avg.NrtvQAQasperMF-enHotpotQA2WikiMQAMusiqueGovReportQMSumMultiNewsTRECTriviaQASAMSumPCountPReLccRB-P Llama2-7B-Chat,Btotal=F ull Full 18.39 20.11 35.67 31.25 25.50 10.14 25.68 20.93 26.27 64...

work page arXiv 2024
[35]

As shown in the left part of Table 16, various temporal smoothing techniques, including Mean, Inv-EMA, and EMA, are tested

35.86 22 Published as a conference paper at ICLR 2026 G.2 E FFICACY OF THE PROPOSED SPATIAL -T EMPORAL SMOOTHING To assess the effectiveness of the spatial-temporal smoothing mechanism, we perform an ablation study to examine the impact of different smoothing methods. As shown in the left part of Table 16, various temporal smoothing techniques, including ...

work page 2026
[36]

Methods such as Avgpool, Maxpool, and our adaptive window-based smoothing (AWS) are com- pared, with AWS achieving the highest average performance. This suggests that the adaptive window-based approach, signiﬁcantly enhances the eviction indicators ability to adjust for vary- ing window sizes and offsets, thereby improving the assessment of the importance...

work page 2000
[37]

All results are normalized to the Full KV caching baseline

Table 17: Efﬁciency analysis on RULER 128k. All results are normalized to the Full KV caching baseline. Method 128k Avg. Acc. TTFT Decoding Latency Full 79.32 1× 1× ReST-KV 68.28 0.97× 10.61× ReST-KV + MInference 53.71 2.99× 10.41× ReST-KV + FlexPreﬁll (γ = 0.9) 67.16 3.42× 10.46× ReST-KV + FlexPreﬁll (γ = 0.95) 68.12 2.37× 10.54× Our method is a KV cache...

work page 2026
[38]

and FlexPreﬁll ( Lai et al. , 2025). When combined with these meth- ods, we observe additional gains in TTFT. For example, integrating FlexPreﬁll with γ = 0 .95 achieves a 2.37Œ TTFT speedup while retaining high decoding efﬁciency (10.54Œ latency speedup) and competitive accuracy (68.12%). This shows that our approach not only accelerates decoding but als...

work page 2025