{"total":11,"items":[{"citing_arxiv_id":"2605.22269","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering","primary_cat":"cs.CV","submitted_at":"2026-05-21T10:13:03+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MuKV adds multi-grained KV cache compression at patch-frame-segment levels plus semi-hierarchical retrieval to raise accuracy and cut memory in long video question-answering.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17921","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"An Efficient Streaming Video Understanding Framework with Agentic Control","primary_cat":"cs.CV","submitted_at":"2026-05-18T06:29:44+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14310","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-14T03:22:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07897","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-08T15:40:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Understanding video streams in real time requires sequential processing of incoming frames [10,15,22,24], and existing approaches fall into two broad categories. One line of work manages KV caches during inference. ReKV [6] retrieves query-relevant entries from stored caches, LiveVLM [21] separates short- and long- term memory with online retrieval, and StreamMem [39] maintains bounded memory through continuous compression. AnotherlineofworkcompressesvisualtokensupstreamoftheLLM.ToMe[ 2]mergessimilartokens,andFluxMem[ 37] introduces a hierarchical memory governed by adaptive thresholds [23]. Beyond streaming video, a parallel line of work in long-context language modeling explores gated and utility-aware memory consolidation, suggesting that selectively"},{"citing_arxiv_id":"2605.01858","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Decouple and Cache: KV Cache Construction for Streaming Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-03T13:02:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DSCache decouples cumulative past and instant KV caches with position-agnostic encoding to adapt offline VideoVLLMs to streaming video, delivering 2.5% average accuracy gains on QA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24317","ref_index":71,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Don't Pause! Every prediction matters in a streaming video","primary_cat":"cs.CV","submitted_at":"2026-04-27T11:07:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SPOT-Bench tests real-time streaming video perception with timeliness metrics, exposing limitations in current models and introducing AsynKV as an improved baseline.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"compute [27, 53, 6, 18, 70, 11], they have since moved toward real-time understanding. [12] initiated this line of work, where a system monitors a live video and responds when needed while staying silent otherwise. Streaming VideoLLMs incrementally process frames at a predefined rate and aim to respond at critical moments [ 12, 63, 49, 71], sometimes proactively [49, 71, 56, 60] and target low latency [ 71, 11, 67]. Despite this progress in processing streaming inputs, there has been limited research on handling streaming outputs and their evaluation. As a result, delayed, missed, or over-responses go unpenalized, making them impractical for always-on, real-time assistants. 3 Preliminaries: Online Video Question-Answering We formally introduce online video question-answering and define two paradigms:"},{"citing_arxiv_id":"2604.10060","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mosaic: Cross-Modal Clustering for Efficient Video Understanding","primary_cat":"cs.PF","submitted_at":"2026-04-11T06:54:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"form A consists of a single NVIDIA H800 GPU (80GB) with an Intel Platinum 8468V CPU. Platform B consists of eight NVIDIA A40 GPUs (48GB each) with an Intel Silver 4314 CPU. Most experiments are conducted on Platform A, while the scalability experiments are conducted on Platform B. Models.We evaluate three vision-language models: LLaV A- OneVision-7B [17], Qwen2.5-VL-7B [18], and Qwen2.5-VL- 3B. These models vary in both architecture and parameter scale, providing strong baselines for video understanding. Dataset.We evaluate our method on five public video bench- marks, as listed in Table II. These benchmarks cover com- plementary scenarios and video lengths of up to two hours. To further assess effectiveness in real-world deployment, we"},{"citing_arxiv_id":"2604.06036","ref_index":84,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference","primary_cat":"cs.DC","submitted_at":"2026-04-07T16:31:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baselines with 0-8% F1 drop.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ChunkAttention: Ef- ficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition.. InACL (1). 11608-11620. [83] Muchao Ye, Weiyang Liu, and Pan He. 2025. VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Mod- els.. InThe IEEE/CVF Conference on Computer Vision and Pattern Recog- nition 2025 (CVPR). 8679-8688. [84] Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. 2025. Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models.. InThe 39th Annual AAAI Conference on Ar- tificial Intelligence (AAAI). 22128-22136. [85] Tongtong Yuan, Xuange Zhang, Kun Liu, Bo Liu, Chen Chen, Jian Jin, and Zhenzhen Jiao. 2024. Towards Surveillance Video-and-"},{"citing_arxiv_id":"2603.20284","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction","primary_cat":"cs.CV","submitted_at":"2026-03-18T06:36:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"STAC compresses KV caches in streaming 3D reconstruction transformers via temporal token preservation with decayed attention, spatial voxel compression, and chunked multi-frame optimization, delivering 10x memory reduction and 4x faster inference at SOTA quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.14724","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-01-21T07:26:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HERMES organizes the KV cache into a hierarchical memory to enable real-time streaming video understanding in MLLMs, achieving 10x faster TTFT and up to 11.4% accuracy gains on streaming benchmarks with 68% fewer tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.15269","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval","primary_cat":"cs.CV","submitted_at":"2025-05-21T08:47:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LiveVLM introduces VSB and PaR to compress and retrieve KV cache in streaming video LLMs, enabling LLaVA-OneVision to reach SOTA accuracy among training-free query-agnostic and training-based online models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}