Recognition: unknown
Decouple and Cache: KV Cache Construction for Streaming Video Understanding
Pith reviewed 2026-05-10 14:44 UTC · model grok-4.3
The pith
Decoupled cache construction lets pretrained video models handle unbounded streams without retraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DSCache maintains a cumulative past KV cache while constructing a separate instant cache on-demand, decoupled from past caches to preserve the informativeness of recent inputs. It incorporates a position-agnostic encoding strategy ensuring KV caches support unseen positions and preventing position overflow, adapting pretrained offline VideoVLLMs to streaming settings without fine-tuning.
What carries the argument
Decoupled Streaming Cache (DSCache), which separates an on-demand instant KV cache for recent inputs from the cumulative past cache and adds position-agnostic encoding for extrapolation.
If this is right
- Pretrained offline models can be used directly for streaming video tasks without additional training on long sequences.
- Recent inputs retain their original informativeness instead of being diluted in a single growing cache.
- Position overflow is avoided, allowing continuous processing of arbitrarily long video streams.
- State-of-the-art accuracy is reached on Streaming Video QA benchmarks with an average 2.5% gain over prior methods.
Where Pith is reading between the lines
- The same decoupling pattern could reduce memory pressure in other long-sequence tasks such as audio or text streams.
- Combining the instant cache with selective eviction rules might further lower compute costs for real-time applications.
- The position-agnostic encoding might transfer to other transformer-based models facing length extrapolation limits.
Load-bearing premise
That a position-agnostic encoding strategy can be added to pretrained models without fine-tuning and will reliably support arbitrary unseen positions while the decoupled instant cache preserves informativeness.
What would settle it
Run DSCache and a standard KV cache baseline on a video stream exceeding the model's training length, checking whether accuracy holds steady or drops sharply due to position overflow.
Figures
read the original abstract
Streaming video understanding requires processing unbounded video streams with limited memory and computation, posing two key challenges. First, continuously constructing new and evicting old key-value(KV) caches is required for unbounded streams. Secondly, due to the high cost of collecting and training on unbounded streams, models must learn from short sequences while generalizing to long streams. Existing streaming VideoVLLMs fail to scale to unbounded video streams or focus on cache reuse strategies, leaving the impact of cache construction underexplored. In this paper, we propose Decoupled Streaming Cache(DSCache), a training-free cache construction mechanism that adapts pretrained offline models to streaming settings. DSCache maintains a cumulative past KV cache while constructing a separate instant cache on-demand, decoupled from past caches to preserve the informativeness of recent inputs. To enable position extrapolation beyond the training length, DSCache further incorporates a position-agnostic encoding strategy, ensuring KV caches to support unseen positions and preventing position overflow. Experiments on Streaming Video QA benchmarks demonstrate DSCache's state-of-the-art performance, with an average 2.5% accuracy gains over prior methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Decoupled Streaming Cache (DSCache), a training-free mechanism to adapt pretrained offline VideoVLLMs to unbounded streaming video understanding. It maintains a cumulative past KV cache while constructing a separate, decoupled instant cache on-demand to preserve recent-input informativeness, and adds a position-agnostic encoding strategy to support extrapolation to unseen positions beyond training lengths and avoid position overflow, claiming SOTA results with an average 2.5% accuracy gain over prior methods on Streaming Video QA benchmarks.
Significance. If the claims hold, the work would be significant for enabling efficient, memory-bounded streaming multimodal inference by reusing existing pretrained weights without fine-tuning or additional parameters. The training-free nature and explicit decoupling of caches represent a practical strength for generalizing short-sequence models to long streams, with potential impact on real-time video QA and analysis applications.
major comments (2)
- [Method (position-agnostic encoding subsection)] The position-agnostic encoding is load-bearing for the central generalization claim (from short training sequences to unbounded streams), yet the manuscript provides no derivation, mathematical invariance proof, or compatibility analysis showing that the modified KV vectors remain compatible with the backbone's attention (e.g., RoPE or absolute embeddings) for arbitrary unseen positions. Without this, cross-cache attention between past and instant caches may degrade.
- [Experiments] The experimental claims of SOTA performance and 2.5% average gains are not supported by sufficient detail: the manuscript lacks explicit baselines, ablation studies isolating the decoupled instant cache versus cumulative cache and the encoding component, implementation specifics for on-demand cache construction, or statistical validation, undermining verification of the performance assertions.
minor comments (1)
- [Abstract] The abstract and introduction could more clearly distinguish the contributions of cache decoupling from the position-agnostic encoding to help readers assess their individual impacts.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our submission. The comments identify key areas where additional rigor and detail would strengthen the manuscript. We address each point below and will revise the paper to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Method (position-agnostic encoding subsection)] The position-agnostic encoding is load-bearing for the central generalization claim (from short training sequences to unbounded streams), yet the manuscript provides no derivation, mathematical invariance proof, or compatibility analysis showing that the modified KV vectors remain compatible with the backbone's attention (e.g., RoPE or absolute embeddings) for arbitrary unseen positions. Without this, cross-cache attention between past and instant caches may degrade.
Authors: We agree that a formal derivation and compatibility analysis are absent from the current manuscript and would strengthen the generalization claims. In the revision we will expand the position-agnostic encoding subsection with (1) a step-by-step explanation of how the encoding removes absolute position dependence while preserving relative positional information, (2) a compatibility argument for RoPE-based attention showing that the modified keys and values produce equivalent attention scores for any unseen position, and (3) an empirical study of cross-cache attention weights on extrapolated sequences to confirm no degradation occurs. These additions will be supported by both analytical reasoning and additional figures. revision: yes
-
Referee: [Experiments] The experimental claims of SOTA performance and 2.5% average gains are not supported by sufficient detail: the manuscript lacks explicit baselines, ablation studies isolating the decoupled instant cache versus cumulative cache and the encoding component, implementation specifics for on-demand cache construction, or statistical validation, undermining verification of the performance assertions.
Authors: We concur that the experimental section requires more granular reporting. The revised manuscript will add: (i) a complete baseline table listing all compared methods with their exact configurations and reported metrics, (ii) ablation tables that separately disable the decoupled instant cache and the position-agnostic encoding to quantify each component's contribution, (iii) pseudocode and hyper-parameter details for the on-demand cache construction procedure, and (iv) statistical validation (means and standard deviations across three random seeds) for the reported accuracy gains. These changes will allow readers to reproduce and verify the 2.5% average improvement. revision: yes
Circularity Check
No circularity: training-free adaptation reuses pretrained weights without self-referential reductions
full rationale
The paper proposes DSCache as an explicit training-free mechanism that decouples instant KV cache construction from cumulative past caches and adds a position-agnostic encoding to support extrapolation in pretrained VideoVLLMs. No equations, fitted parameters, or predictions are shown to reduce by construction to the input data or prior self-citations; the central claims rest on the architectural description and empirical benchmark gains rather than any self-definitional or load-bearing self-referential step. The derivation chain is therefore self-contained against external pretrained models and streaming video QA data.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Memory-efficient streaming VideoLLMs for real-time procedural video understanding
Chatterjee, D., Remelli, E., Song, Y ., Tekin, B., Mittal, A., Bhatnagar, B., Camg ˜Ak ¸z, N. C., Hampali, S., Sauser, E., Ma, S., et al. Memory-efficient streaming videollms for real-time procedural video understanding.arXiv preprint arXiv:2504.13915,
-
[3]
Extending Context Window of Large Language Models via Positional Interpolation
Chen, S., Wong, S., Chen, L., and Tian, Y . Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023a. Chen, Y ., Qian, S., Tang, H., Lai, X., Liu, Z., Han, S., and Jia, J. Longlora: Efficient fine-tuning of long-context large language models.arXiv preprint arXiv:2309.12307, 2023b. Di, S., Yu, Z....
work page internal anchor Pith review arXiv
-
[4]
Vispeak: Visual instruction feedback in streaming videos.CoRR, abs/2503.12769, 2025
Fu, S., Yang, Q., Li, Y .-M., Peng, Y .-X., Lin, K.-Y ., Wei, X., Hu, J.-F., Xie, X., and Zheng, W.-S. Vispeak: Visual instruction feedback in streaming videos.arXiv preprint arXiv:2503.12769,
-
[5]
Lm-infinite: Zero-shot extreme length generalization for large language models
Han, C., Wang, Q., Peng, H., Xiong, W., Chen, Y ., Ji, H., and Wang, S. Lm-infinite: Zero-shot extreme length generalization for large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pp. 3991–4008,
2024
-
[6]
arXiv preprint arXiv:2506.15745 (2025)
Kim, M., Shim, K., Choi, J., and Chang, S. Infinipot-v: Memory-constrained kv cache compression for streaming video understanding.arXiv preprint arXiv:2506.15745,
-
[7]
LLaVA-OneVision: Easy Visual Task Transfer
Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y ., Liu, Z., et al. Llava- onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Li, W., Hu, B., Shao, R., Shen, L., and Nie, L. Lion- fs: Fast & slow video-language thinker as online video assistant. InProceedings of the Computer Vision and Patter...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Li, Y ., Huang, Y ., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947– 22970, 2024b. Lin, J., Fang, Z., Chen, C., Wan, Z., Luo, F., Li, P., Liu, Y ., and Sun, M. Streamingbench: Assessing the gap for mllm...
-
[9]
Lost in the Middle: How Language Models Use Long Contexts
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. Lost in the middle: How lan- guage models use long contexts, 2023.URL https://arxiv. org/abs/2307.03172,
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval
Ning, Z., Liu, G., Jin, Q., Ding, W., Guo, M., and Zhao, J. Livevlm: Efficient online video understanding via streaming-oriented kv cache and retrieval.arXiv preprint arXiv:2505.15269,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Pang, Z., Chatterjee, D., Sener, F., and Yao, A. On discrim- inative vs. generative classifiers: Rethinking mllms for action understanding.arXiv preprint arXiv:2603.02546,
-
[12]
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation.arXiv preprint arXiv:2108.12409,
work page internal anchor Pith review arXiv
-
[13]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Team, G., Georgiev, P., Lei, V . I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understand- ing across millions of tokens of context.arXiv preprint arXiv:2403.05530,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
arXiv preprint arXiv:2505.05467 , year=
Wang, H., Feng, B., Lai, Z., Xu, M., Li, S., Ge, W., De- hghan, A., Cao, M., and Huang, P. Streambridge: Turning your offline video large language model into a proactive streaming assistant.arXiv preprint arXiv:2505.05467,
-
[15]
Efficient Streaming Language Models with Attention Sinks
Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Ef- ficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453,
work page internal anchor Pith review arXiv
-
[16]
arXiv preprint arXiv:2510.09608 , year=
Xu, R., Xiao, G., Chen, Y ., He, L., Peng, K., Lu, Y ., and Han, S. Streamingvlm: Real-time understanding for infinite video streams.arXiv preprint arXiv:2510.09608,
-
[17]
Yang, Y ., Zhao, Z., Shukla, S. N., Singh, A., Mishra, S. K., Zhang, L., and Ren, M. Streammem: Query-agnostic kv cache memory for streaming video understanding.arXiv preprint arXiv:2508.15717,
-
[18]
Zhang, H., Wang, Y ., Tang, Y ., Liu, Y ., Feng, J., Dai, J., and Jin, X. Flash-vstream: Memory-based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085,
-
[19]
Time-wise: the same reasoning applies at past or future steps, implying equivalence for all time steps. Eviction-free.Let the current input be Xu ={x i+lu }nu−1 i=0 with xi ∈R d being the token embedding, and the past KV cache constructed from previous inputs {Xj}u−1 j=0 , where lu denotes the total number of observed tokens up to u, and nu denotes the le...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.