arxiv: 2605.04075 · v1 · submitted 2026-04-14 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction

chengfei lv Shengyu Zhang, Sihao Liu, YuFan Xiong, Zhaode Wang, Zhonghua Jiang

Pith reviewed 2026-05-10 15:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords KV cache compressionmultimodal large language modelsstate space modelsentropy guided evictionvisual token retentioncontinuous memory evolutiondecoding acceleration

0 comments

The pith

RetentiveKV reformulates discrete KV cache pruning as entropy-guided continuous state evolution to retain deferred visual tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard importance-based eviction breaks down for visual tokens because many start with low attention yet become essential later in decoding. It replaces abrupt deletion with a state-space process that folds those tokens into an evolving memory using entropy as a measure of their latent information value. This keeps spatial relationships among visual features intact instead of fragmenting them. If correct, the approach lets multimodal models handle much longer image contexts at far lower memory cost while still recovering the right details when they matter.

Core claim

RetentiveKV reformulates KV eviction from discrete context truncation to continuous memory evolution based on State Space Models. It leverages information entropy to quantify the information potential of low-attention tokens and integrates tokens scheduled for eviction into a continuous state space through entropy-guided state transitions, enabling their dynamic reactivation when semantic relevance arises during subsequent decoding.

What carries the argument

Entropy-guided state transitions in a state-space model that evolve representations of low-attention tokens while preserving reactivation potential.

If this is right

Low-attention visual tokens remain available for reactivation without occupying full cache slots throughout decoding.
Spatial continuity among visual features is maintained across steps instead of being broken by hard pruning.
KV cache size drops by a factor of five on multimodal benchmarks while output quality holds.
Decoding runs 1.5 times faster because fewer tokens occupy active memory at each step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same entropy-driven evolution could apply to long text contexts where token importance also shifts over many steps.
Pairing the continuous state with other compression layers might push ratios beyond the reported five times.
Attention scores alone appear insufficient for eviction decisions once deferred relevance is considered.

Load-bearing premise

Entropy scores reliably identify which low-attention visual tokens will matter later, and the continuous state updates preserve their content without creating new decoding errors.

What would settle it

A controlled test on a multimodal benchmark where a high-entropy low-attention token is evicted under the method yet the model later fails on a query that requires exactly that token's information.

Figures

Figures reproduced from arXiv: 2605.04075 by chengfei lv Shengyu Zhang, Sihao Liu, YuFan Xiong, Zhaode Wang, Zhonghua Jiang.

**Figure 1.** Figure 1: (a) Visualization of Deferred-Critical Tokens: The presence of delayed activation patterns, where visually [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Entropy Shifts under KV Eviction. 3.4 Entropy-Guided KV Retention Estimator Entropy-Guided KV Retention Estimator introduces the cross-modal attention entropy to measure the distributional uncertainty of attention from textual tokens to visual tokens. Let α l,i t denotes the standard attention score for token i at decoding step t for the l-th layer, pv(·) represents the cross-modal attention scores selec… view at source ↗

**Figure 3.** Figure 3: Overview of the proposed RetentiveKV framework. The architecture coordinates efficient long-context [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison results for various cache budgets. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Multimodal Large Language Models face severe challenges in computational efficiency and memory consumption due to the substantial expansion of the visual KV cache when processing long visual contexts. Existing KV cache compression methods typically rely on the "persistence of importance" hypothesis to prune tokens. However, this approach proves fragile in multimodal settings due to two key issues: 1) Visual tokens display "deferred importance," initially exhibiting low salience but becoming pivotal during later decoding, which can lead to premature eviction. 2) Discrete pruning disrupts the inherent spatial continuity of visual cues. To address these challenges, we propose RetentiveKV, an entropy-driven KV cache optimization method that reformulates KV eviction from "discrete context truncation" to "continuous memory evolution" based on State Space Models. Our method leverages information entropy to quantify the information potential of low-attention tokens and integrates tokens scheduled for eviction into a continuous state space through entropy-guided state transitions, enabling their dynamic reactivation when semantic relevance arises during subsequent decoding. Extensive experiments on multimodal benchmarks demonstrate that RetentiveKV achieves 5.0 times KV cache compression and 1.5 times decoding acceleration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RetentiveKV reframes multimodal KV eviction as entropy-guided state-space evolution to handle deferred visual token importance, but the abstract supplies no equations or reactivation tests to support the 5x compression claim.

read the letter

The main point is that RetentiveKV reformulates KV cache eviction for multimodal LLMs as entropy-driven continuous state evolution using state-space models, aiming to handle deferred importance of visual tokens and preserve spatial continuity better than discrete pruning. This stands out as a distinct technique that moves away from the persistence of importance hypothesis by integrating scheduled eviction tokens into a state space for dynamic reactivation. The paper does a good job spelling out why visual tokens can have low initial salience but become key later, and why hard pruning breaks continuity. It earns credit for proposing a mechanism that could allow better compression in long visual contexts without immediate loss of information. The soft spots are clear from the abstract. There are no state-update equations provided, no analysis of whether the SSM transitions accumulate error or maintain semantic content, and no ablations or metrics on reactivation success rates. The reported 5x compression and 1.5x acceleration come without baseline details, error bars, or data exclusion rules, so it's impossible to tell if the method truly works or just adds overhead. The stress-test concern about lacking bounds or tests for preserving reactivation potential is accurate based on what's available. The work shows honest engagement with the literature on KV compression but the evidence is thin so far. This is aimed at researchers and practitioners optimizing inference for multimodal models with extended visual inputs. Someone focused on practical efficiency gains in vision-language systems could extract useful ideas from the approach. I'd recommend sending it for peer review to get the full derivations and experimental validation, as the problem it targets is relevant even if the current presentation needs strengthening.

Referee Report

2 major / 1 minor

Summary. The paper proposes RetentiveKV, an entropy-driven KV cache optimization for multimodal LLMs that reformulates eviction from discrete pruning to continuous state evolution via state-space models (SSMs). It uses information entropy to identify low-attention visual tokens with deferred importance and integrates them into SSM transitions for dynamic reactivation, claiming to preserve spatial continuity while achieving 5.0x KV cache compression and 1.5x decoding acceleration on multimodal benchmarks.

Significance. If the core assumptions hold, the method could meaningfully advance efficient long-context multimodal inference by mitigating premature eviction of visual tokens and avoiding discrete pruning artifacts. The reformulation from truncation to continuous memory evolution is a conceptually coherent response to documented weaknesses in persistence-of-importance heuristics for vision-language settings.

major comments (2)

[Abstract] Abstract: The central performance claims (5.0x compression, 1.5x acceleration) are asserted without any derivation details, error bars, baseline comparisons, or data exclusion rules. This directly undermines evaluation of the result, as the abstract supplies neither the explicit SSM state-update equations nor ablation results measuring reactivation success rate versus retained KV tokens.
[Method] Method description: The entropy-guided SSM transitions are presented as preserving spatial and semantic content for later reactivation, yet no bound, stability analysis, or empirical test is referenced for whether the continuous evolution avoids accumulating approximation error or breaking continuity for deferred visual tokens. If this fails, the approach reduces to standard pruning with added overhead, invalidating the claimed gains.

minor comments (1)

[Abstract] The abstract would be strengthened by naming the specific multimodal benchmarks and listing the baseline KV compression methods against which the 5.0x and 1.5x figures are measured.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving clarity and rigor, particularly around the presentation of results and theoretical grounding. We respond to each major comment below and have made revisions to the manuscript where feasible.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims (5.0x compression, 1.5x acceleration) are asserted without any derivation details, error bars, baseline comparisons, or data exclusion rules. This directly undermines evaluation of the result, as the abstract supplies neither the explicit SSM state-update equations nor ablation results measuring reactivation success rate versus retained KV tokens.

Authors: We agree that the abstract's brevity limits the inclusion of supporting details. The reported compression and acceleration figures are derived from the full experimental evaluation in Section 4, which includes baseline comparisons (e.g., against H2O and SnapKV), error bars in the performance plots, and explicit data exclusion criteria in the setup. The core SSM state-update equation appears as Equation (3) in the method section, and ablation results on reactivation rates are in Section 4.3. In the revised manuscript, we have slightly expanded the abstract to reference Equation (3) and the ablation studies while respecting length constraints; full derivations and tables remain in the main text and appendix. revision: yes
Referee: [Method] Method description: The entropy-guided SSM transitions are presented as preserving spatial and semantic content for later reactivation, yet no bound, stability analysis, or empirical test is referenced for whether the continuous evolution avoids accumulating approximation error or breaking continuity for deferred visual tokens. If this fails, the approach reduces to standard pruning with added overhead, invalidating the claimed gains.

Authors: We appreciate this observation on the need for stronger validation of the continuous evolution. The original manuscript provides empirical support in Section 4.2, including reactivation success rates (over 85% of deferred tokens reactivated) and visualizations confirming preservation of spatial and semantic content via attention similarity metrics. However, no formal stability analysis or error bound was included. In the revision, we have added a discussion of error accumulation using the contractive properties of the linear SSM under entropy guidance, along with an additional long-sequence continuity experiment. A complete theoretical proof of long-term bounds remains outside the scope of this work. revision: partial

standing simulated objections not resolved

A rigorous theoretical stability analysis and explicit bounds on approximation error accumulation for the entropy-guided SSM state transitions.

Circularity Check

0 steps flagged

No circularity: method presented as independent reformulation using external SSM and entropy concepts

full rationale

The provided abstract and context describe RetentiveKV as a new algorithmic reformulation of KV eviction into continuous state evolution via State Space Models and entropy quantification. No equations, self-citations, or fitted parameters are shown that reduce the claimed 5x compression or reactivation mechanism back to the inputs by definition. The derivation relies on standard external components (SSM transitions, information entropy) without load-bearing self-references or renaming of known results as novel predictions. This is the common case of a self-contained proposal evaluated on external multimodal benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that state-space models can faithfully evolve token states for later reactivation; no explicit free parameters or new invented entities are named in the abstract.

axioms (1)

domain assumption State Space Models can represent continuous memory evolution for tokens scheduled for eviction
The method reformulates discrete pruning into SSM-based state transitions.

pith-pipeline@v0.9.0 · 5516 in / 1246 out tokens · 59709 ms · 2026-05-10T15:25:41.396536+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 15 canonical work pages · 5 internal anchors

[2]

Advances in Neural Information Processing Systems , volume=

H2o: Heavy-hitter oracle for efficient generative inference of large language models , author=. Advances in Neural Information Processing Systems , volume=
[3]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[4]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[5]

International Conference on Learning Representations (ICLR) , year =

Lu, Pan and Bansal, Hritik and Xia, Tony and Liu, Jiacheng and Li, Chunyuan and Hajishirzi, Hannaneh and Cheng, Hao and Chang, Kai-Wei and Galley, Michel and Gao, Jianfeng , title =. International Conference on Learning Representations (ICLR) , year =
[6]

Blink: Multimodal large language models can see but not perceive

BLINK: Multimodal Large Language Models Can See but Not Perceive , author=. arXiv preprint arXiv:2404.12390 , year=

work page arXiv
[10]

Advances in Neural Information Processing Systems , volume=

Snapkv: Llm knows what you are looking for before generation , author=. Advances in Neural Information Processing Systems , volume=
[13]

Advances in Neural Information Processing Systems , volume=

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time , author=. Advances in Neural Information Processing Systems , volume=
[14]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling , author=. arXiv preprint arXiv:2406.02069 , year=

work page internal anchor Pith review arXiv
[15]

European Conference on Computer Vision , pages=

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[16]

Advances in neural information processing systems , volume=

Hippo: Recurrent memory with optimal polynomial projections , author=. Advances in neural information processing systems , volume=
[18]

First conference on language modeling , year=

Mamba: Linear-time sequence modeling with selective state spaces , author=. First conference on language modeling , year=
[20]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

UNCOMP: Can Matrix Entropy Uncover Sparsity?—A Compressor Design from an Uncertainty-Aware Perspective , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[21]

Trends in cognitive sciences , volume=

The episodic buffer: a new component of working memory? , author=. Trends in cognitive sciences , volume=. 2000 , publisher=

2000
[22]

Psychology of learning and motivation , volume=

Human memory: A proposed system and its control processes , author=. Psychology of learning and motivation , volume=. 1968 , publisher=

1968
[23]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[24]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Boosting multimodal large language models with visual tokens withdrawal for rapid inference , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[27]

2025 , eprint=

PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models , author=. 2025 , eprint=

2025
[28]

2025 , eprint=

AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization , author=. 2025 , eprint=

2025
[29]

2025 , eprint=

FlowMM: Cross-Modal Information Flow Guided KV Cache Merging for Efficient Multimodal Context Inference , author=. 2025 , eprint=

2025
[30]

Alan Baddeley. 2000. The episodic buffer: a new component of working memory? Trends in cognitive sciences, 4(11):417--423

2000
[31]

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. 2024 a . An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In European Conference on Computer Vision, pages 19--35. Springer

2024
[32]

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and 1 others. 2024 b . Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330

work page internal anchor Pith review arXiv 2024
[33]

Albert Gu and Tri Dao. 2024. Mamba: Linear-time sequence modeling with selective state spaces. In First conference on language modeling

2024
[34]

Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher R \'e . 2020. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474--1487

2020
[35]

Albert Gu, Karan Goel, and Christopher R \'e . 2021. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396

work page internal anchor Pith review arXiv 2021
[36]

Ahmadreza Jeddi, Negin Baghbanzadeh, Elham Dolatabadi, and Babak Taati. 2025. Similarity-aware token pruning: Your vlm but faster. arXiv preprint arXiv:2503.11549

work page arXiv 2025
[37]

Zhonghua Jiang, Kui Chen, Kunxi Li, Keting Yin, Yiyun Zhou, Zhaode Wang, Chengfei Lv, and Shengyu Zhang. 2025 a . https://arxiv.org/abs/2511.11106 Acckv: Towards efficient audio-video llms inference via adaptive-focusing and cross-calibration kv cache optimization . Preprint, arXiv:2511.11106

work page arXiv 2025
[38]

Zhonghua Jiang, Kunxi Li, Yiyun Zhou, Sihao Liu, Zhaode Wang, Chengfei lv, and Shengyu Zhang. 2025 b . https://arxiv.org/abs/2510.25600 Purekv: Plug-and-play kv cache optimization with spatial-temporal sparse attention for vision-language large models . Preprint, arXiv:2510.25600

work page arXiv 2025
[39]

Kunxi Li, Zhonghua Jiang, Zhouzhou Shen, ZhaodeWang ZhaodeWang, Chengfei Lv, Shengyu Zhang, Fan Wu, and Fei Wu. 2025 a . Madakv: Adaptive modality-perception kv cache eviction for efficient multimodal long-context inference. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13306--13318

2025
[40]

Kunxi Li, Yufan Xiong, Zhonghua Jiang, Yiyun Zhou, Zhaode Wang, Chengfei Lv, and Shengyu Zhang. 2025 b . https://arxiv.org/abs/2511.05534 Flowmm: Cross-modal information flow guided kv cache merging for efficient multimodal context inference . Preprint, arXiv:2511.05534

work page arXiv 2025
[41]

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. 2024. Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems, 37:22947--22970

2024
[42]

Zhihang Lin, Mingbao Lin, Luxi Lin, and Rongrong Ji. 2025. Boosting multimodal large language models with visual tokens withdrawal for rapid inference. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 5334--5342

2025
[43]

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. 2023. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36:52342--52364

2023
[44]

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. 2024. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In International Conference on Learning Representations (ICLR)

2024
[45]

Minesh Mathew, Dimosthenis Karatzas, R Manmatha, and CV Jawahar. 2020. Docvqa: A dataset for vqa on document images. corr abs/2007.00398 (2020). arXiv preprint arXiv:2007.00398

work page arXiv 2020
[46]

Jiahe Shi, Zhengqi Gao, Ching-Yun Ko, and Duane Boning. 2025. Earl: Entropy-aware rl alignment of llms for reliable rtl code generation. arXiv preprint arXiv:2511.12033

work page arXiv 2025
[47]

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317--8326

2019
[48]

Dingjie Song, Shunian Chen, Guiming Hardy Chen, Fei Yu, Xiang Wan, and Benyou Wang. 2024. Milebench: Benchmarking mllms in long context. arXiv preprint arXiv:2404.18532

work page arXiv 2024
[49]

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. 2023. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621

work page internal anchor Pith review arXiv 2023
[50]

Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, and Mi Zhang. 2025. Meda: Dynamic kv cache allocation for efficient multimodal long-context inference. arXiv preprint arXiv:2502.17599

work page arXiv 2025
[51]

Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, and Li Yuan. 2024. Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference. arXiv preprint arXiv:2406.18139

work page arXiv 2024
[52]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453

work page internal anchor Pith review arXiv 2023
[53]

Jing Xiong, Jianghan Shen, Fanghua Ye, Chaofan Tao, Zhongwei Wan, Jianqiao Lu, Xun Wu, Chuanyang Zheng, Zhijiang Guo, Min Yang, and 1 others. 2025. Uncomp: Can matrix entropy uncover sparsity?—a compressor design from an uncertainty-aware perspective. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4179--4199

2025
[54]

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, and 1 others. 2024. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556--9567

2024
[55]

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R \'e , Clark Barrett, and 1 others. 2023. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36:34661--34710

2023