pith. machine review for the scientific record. sign in

arxiv: 2605.07897 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding

Hang Wu, Ming-Hsuan Yang, Sherin Mary Mathews, Yiwei Wang, Yujun Cai

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:09 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords streaming video understandingvisual memory managementsemantic-aware compressionquery-adaptive retrievaltraining-free frameworkvision-language modelsreal-time video processing
0
0 comments X

The pith

Semantic signals from a fixed question bank let models retain relevant frames in streaming videos and retrieve them adaptively, improving accuracy while halving memory use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SAVEMem, a training-free dual-stage framework that manages memory for vision-language models handling continuous video streams and real-time queries. Existing approaches rely on visual similarity for compression or add retrieval after the fact, but SAVEMem incorporates semantic priors early to decide what to keep long-term and adapts retrieval scope per query. A fixed pseudo-question bank shapes retention in a three-tier memory under a fixed budget. An anchor-conditioned recency gate then expands or contracts the retrieval window from short-term to long-term memory depending on the query's temporal target. Late interaction within that window selects frames for the answer. When added to Qwen2.5-VL, the method raises OVO-Bench scores from 52.27 to 62.69 while cutting peak GPU memory by 48 percent at 128 frames.

Core claim

SAVEMem builds a three-tier streaming memory online under constant budget where a fixed pseudo-question bank supplies semantic salience to guide long-term retention instead of visual similarity alone; a second stage then applies query-aware retrieval via an anchor-conditioned recency gate that adapts scope across short-, mid-, and long-term tiers before late interaction selects candidate frames.

What carries the argument

Dual-stage SAVEMem: semantic prior from fixed pseudo-question bank for three-tier memory generation, plus anchor-conditioned recency gate for query-adaptive retrieval scope.

If this is right

  • Applied without training, SAVEMem raises OVO-Bench overall score from 52.27 to 62.69 on Qwen2.5-VL.
  • Consistent gains appear on StreamingBench and ODV-Bench under the same zero-training setting.
  • Peak GPU memory at 128 frames falls by 48 percent relative to the unmodified backbone.
  • The three-tier memory plus adaptive retrieval coordinates compression and retrieval in one pipeline rather than treating them separately.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same semantic-prior approach could be tested on other vision-language backbones to check whether the gains transfer without retraining.
  • Varying the size or content of the pseudo-question bank might reveal how much semantic coverage is needed for different video domains.
  • Because retrieval scope now adapts to query timing, the method may reduce unnecessary token loading in very long streams where most queries target recent frames.

Load-bearing premise

A fixed pseudo-question bank supplies a lightweight yet effective semantic prior that shapes long-term retention decisions better than visual similarity alone.

What would settle it

Removing the pseudo-question bank from Stage 1 or the recency gate from Stage 2 and measuring whether OVO-Bench, StreamingBench, and ODV-Bench scores drop back to or below the Qwen2.5-VL baseline at the same memory budget.

read the original abstract

Online streaming video understanding requires models to process continuous visual inputs and respond to user queries in real time, where the unbounded stream and unpredictable query timing turn memory management into a central challenge. Existing methods typically compress visual tokens via visual similarity heuristics, or augment compression with KV-cache-level retrieval. However, compression decisions rarely incorporate semantic signals, and retrieval is often added after compression is finalized, making the two stages hard to coordinate. We present SAVEMem, a training-free dual-stage framework that brings semantic awareness into memory generation and lets the retrieval scope adapt per query. In Stage~1, SAVEMem builds a three-tier streaming memory online under a constant memory budget. A fixed pseudo-question bank provides a lightweight semantic prior, so that long-term retention is shaped by semantic salience rather than visual similarity alone. In Stage~2, SAVEMem performs query-aware retrieval over this memory. An anchor-conditioned recency gate adapts the retrieval scope from short-term to mid- and long-term memory based on whether the query targets the present or the distant past. Within this scope, late interaction between query and memory tokens selects candidate frames for answering. Applied to Qwen2.5-VL without training, SAVEMem improves the OVO-Bench overall score from 52.27 to 62.69 and yields consistent gains on StreamingBench and ODV-Bench, while reducing peak GPU memory by 48\% at 128 frames over the backbone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents SAVEMem, a training-free dual-stage framework for streaming video understanding. In Stage 1, it builds a three-tier streaming memory online under constant budget, using a fixed pseudo-question bank to shape long-term retention by semantic salience rather than visual similarity alone. In Stage 2, it performs query-aware retrieval with an anchor-conditioned recency gate that adapts scope from short- to long-term memory, followed by late interaction for frame selection. Applied to Qwen2.5-VL, it reports OVO-Bench improvement from 52.27 to 62.69, consistent gains on StreamingBench and ODV-Bench, and 48% peak GPU memory reduction at 128 frames.

Significance. If the results hold under proper controls, this would represent a meaningful advance in efficient, training-free memory management for vision-language models handling unbounded video streams. The training-free application to an existing backbone, the explicit constant-budget constraint, and the reported memory savings are clear strengths. Consistent cross-benchmark gains suggest practical relevance for real-time streaming tasks.

major comments (3)
  1. [Abstract / Stage 1] Abstract and Stage 1 description: The central claim attributes the +10.42 OVO-Bench gain (and consistent lifts elsewhere) to the semantic prior from the fixed pseudo-question bank outperforming visual-similarity heuristics. However, no ablation is described that swaps only this bank for a pure visual baseline while freezing the three-tier structure, constant budget, and Stage 2 components. Without this isolation, the semantic-awareness contribution cannot be confirmed as the causal driver.
  2. [Abstract / Experiments] Abstract and Experiments section: The headline numeric improvements are presented without details on experimental controls, baseline re-implementations, statistical significance, variance across runs, or checks for data-selection effects. This leaves the support for the reported gains provisional and makes it difficult to assess whether the gains generalize beyond the specific evaluation setup.
  3. [Stage 1] Stage 1 description: No construction details are supplied for the pseudo-question bank (size, source, curation process, or cross-benchmark generality). If the bank is derived from or tuned to the evaluated benchmarks (OVO-Bench, StreamingBench, ODV-Bench), the semantic prior risks circularity and the claim that it supplies an independent lightweight semantic signal would be weakened.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'consistent gains' is used without reporting the per-benchmark deltas or absolute scores; adding these numbers would improve precision and allow readers to gauge effect sizes directly.
  2. [Abstract] Notation and terminology: The terms 'three-tier streaming memory' and 'anchor-conditioned recency gate' are introduced without a brief inline gloss in the abstract; a short parenthetical definition on first use would aid accessibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications on the current manuscript and committing to specific revisions that strengthen the presentation of SAVEMem without altering its core claims.

read point-by-point responses
  1. Referee: [Abstract / Stage 1] Abstract and Stage 1 description: The central claim attributes the +10.42 OVO-Bench gain (and consistent lifts elsewhere) to the semantic prior from the fixed pseudo-question bank outperforming visual-similarity heuristics. However, no ablation is described that swaps only this bank for a pure visual baseline while freezing the three-tier structure, constant budget, and Stage 2 components. Without this isolation, the semantic-awareness contribution cannot be confirmed as the causal driver.

    Authors: We agree that a controlled ablation replacing only the pseudo-question bank with a pure visual-similarity retention policy (while freezing the three-tier memory structure, constant budget, and all Stage 2 components) would provide the cleanest isolation of the semantic prior's contribution. The current manuscript demonstrates gains over existing visual-heuristic methods but does not include this exact swap. We will add this ablation in the revised version to directly support the causal attribution. revision: yes

  2. Referee: [Abstract / Experiments] Abstract and Experiments section: The headline numeric improvements are presented without details on experimental controls, baseline re-implementations, statistical significance, variance across runs, or checks for data-selection effects. This leaves the support for the reported gains provisional and makes it difficult to assess whether the gains generalize beyond the specific evaluation setup.

    Authors: We acknowledge the need for greater transparency on experimental rigor. The reported results follow the standard protocols of OVO-Bench, StreamingBench, and ODV-Bench, and the method is fully deterministic given fixed inputs and the fixed pseudo-question bank. In the revision we will expand the Experiments section with explicit details on baseline re-implementations, any statistical significance checks performed, run-to-run variance (expected to be zero), and explicit checks confirming that gains are not driven by data-selection artifacts. revision: yes

  3. Referee: [Stage 1] Stage 1 description: No construction details are supplied for the pseudo-question bank (size, source, curation process, or cross-benchmark generality). If the bank is derived from or tuned to the evaluated benchmarks (OVO-Bench, StreamingBench, ODV-Bench), the semantic prior risks circularity and the claim that it supplies an independent lightweight semantic signal would be weakened.

    Authors: The pseudo-question bank is a fixed, benchmark-agnostic collection of generic questions intended to capture common semantic aspects of video streams. We will add a new subsection in the revised Stage 1 description that fully specifies its size, source, curation process, and evidence of generality across benchmarks, thereby confirming that it functions as an independent lightweight semantic prior and eliminating any appearance of circularity. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper describes a training-free dual-stage framework applied directly to the pre-existing Qwen2.5-VL backbone. Reported gains on OVO-Bench, StreamingBench, and ODV-Bench are presented as empirical outcomes of the method rather than quantities derived by construction from fitted parameters or self-referential definitions. No equations, self-citations, or procedures in the abstract or described pipeline reduce the central claims (semantic prior via fixed pseudo-question bank, adaptive retrieval) to inputs by definition. The method is explicitly positioned as not requiring training or benchmark-specific fitting, satisfying the criteria for a self-contained, non-circular presentation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework rests on the assumption that a fixed pseudo-question bank can serve as an effective semantic prior and that the recency gate can reliably adapt retrieval scope; these are methodological choices rather than fitted parameters or new physical entities.

axioms (1)
  • domain assumption A fixed pseudo-question bank provides a sufficient semantic prior for shaping long-term memory retention
    Invoked in the description of Stage 1 memory construction
invented entities (2)
  • three-tier streaming memory no independent evidence
    purpose: Organize visual tokens under constant budget with semantic salience guiding retention
    Core component introduced in Stage 1
  • anchor-conditioned recency gate no independent evidence
    purpose: Dynamically adjust retrieval scope from short-term to long-term memory based on query timing
    Core component introduced in Stage 2

pith-pipeline@v0.9.0 · 5569 in / 1430 out tokens · 49190 ms · 2026-05-11T02:09:00.735347+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 12 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

  2. [2]

    Token merging: Your ViT but faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InICLR, 2023

  3. [3]

    Videollm-online: Online video large language model for streaming video

    Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InCVPR, 2024

  4. [4]

    End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  5. [5]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024

  6. [6]

    Streaming video question-answering with in-context video kv-cache retrieval

    Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, Hao Jiang, et al. Streaming video question-answering with in-context video kv-cache retrieval. InICLR, 2025

  7. [7]

    Contextnav: Towards agentic multimodal in-context learning.arXiv preprint arXiv:2510.04560, 2025

    Honghao Fu, Yuan Ouyang, Kai-Wei Chang, Yiwei Wang, Zi Huang, and Yujun Cai. Contextnav: Towards agentic multimodal in-context learning.arXiv preprint arXiv:2510.04560, 2025

  8. [8]

    Vispeak: Visual instruction feedback in streaming videos.ICCV, 2025

    Shenghao Fu, Qize Yang, Yuan-Ming Li, Yi-Xing Peng, Kun-Yu Lin, Xihan Wei, Jian-Fang Hu, Xiaohua Xie, and Wei-Shi Zheng. Vispeak: Visual instruction feedback in streaming videos.ICCV, 2025

  9. [9]

    Framemind: Frame-interleaved video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025

    Haonan Ge, Yiwei Wang, Kai-Wei Chang, Hang Wu, and Yujun Cai. Framemind: Frame-interleaved video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025

  10. [10]

    Online video understanding: Ovbench and videochat-online

    Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, and Limin Wang. Online video understanding: Ovbench and videochat-online. InCVPR, pages 3328–3338, 2025

  11. [11]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv:2410.21276, 2024

  12. [12]

    Colbert: Efficient and effective passage search via contextualized late interaction over bert

    Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48, 2020

  13. [13]

    Interaction methods for smart glasses: A survey.IEEE access, 6:28712–28732, 2018

    Lik-Hang Lee and Pan Hui. Interaction methods for smart glasses: A survey.IEEE access, 6:28712–28732, 2018

  14. [14]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  15. [15]

    arXiv preprint arXiv:2411.03628 , year=

    JunmingLin,ZhengFang,ChiChen,ZihaoWan,FuwenLuo,PengLi,YangLiu,andMaosongSun. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding.arXiv preprint arXiv:2411.03628, 2024

  16. [16]

    Aligning cyber space with physical world: A comprehensive survey on embodied ai.IEEE/ASME Transactions on Mechatronics, 2025

    Yang Liu, Weixing Chen, Yongjie Bai, Xiaodan Liang, Guanbin Li, Wen Gao, and Liang Lin. Aligning cyber space with physical world: A comprehensive survey on embodied ai.IEEE/ASME Transactions on Mechatronics, 2025

  17. [17]

    Thinking in streaming video.arXiv preprint arXiv:2603.12938, 2026

    Zikang Liu, Longteng Guo, Handong Li, Ru Zhen, Xingjian He, Ruyi Ji, Xiaoming Ren, Yanhao Zhang, Haonan Lu, and Jing Liu. Thinking in streaming video.arXiv preprint arXiv:2603.12938, 2026

  18. [18]

    Video-chatgpt: Towards detailed video understanding via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InACL, 2024

  19. [19]

    A Survey of Context Engineering for Large Language Models

    Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, et al. A survey of context engineering for large language models.arXiv preprint arXiv:2507.13334, 2025

  20. [20]

    Gated differentiable working memory for long-context language modeling.arXiv preprint arXiv:2601.12906, 2026

    Lingrui Mei, Shenghua Liu, Yiwei Wang, Yuyao Ge, Baolong Bi, Jiayu Yao, Jun Wan, Ziling Yin, Jiafeng Guo, and Xueqi Cheng. Gated differentiable working memory for long-context language modeling.arXiv preprint arXiv:2601.12906, 2026

  21. [21]

    LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

    ZhenyuNing,GuangdaLiu,QihaoJin,WenchaoDing,MinyiGuo,andJieruZhao. Livevlm: Efficientonlinevideounderstanding via streaming-oriented kv cache and retrieval.arXiv preprint arXiv:2505.15269, 2025. 11

  22. [22]

    Ovo-bench: How far is your video-llms from real-world online video understanding? InCVPR, pages 18902–18913, 2025

    JunboNiu,YifeiLi,ZiyangMiao,ChunjiangGe,YuanhangZhou,QihaoHe,XiaoyiDong,HaodongDuan,ShuangruiDing,Rui Qian, et al. Ovo-bench: How far is your video-llms from real-world online video understanding? InCVPR, pages 18902–18913, 2025

  23. [23]

    Athresholdselectionmethodfromgray-levelhistograms.IEEETransactionsonSystems,Man,andCybernetics, 9(1):62–66, 1979

    NobuyukiOtsu. Athresholdselectionmethodfromgray-levelhistograms.IEEETransactionsonSystems,Man,andCybernetics, 9(1):62–66, 1979

  24. [24]

    Streaming long video understanding with large language models.NeurIPS, 37:119336–119360, 2024

    Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models.NeurIPS, 37:119336–119360, 2024

  25. [25]

    Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction

    Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction. InCVPR, 2025

  26. [26]

    Longvu: Spatiotemporal adaptive compression for long video-language understanding

    XiaoqianShen,YunyangXiong,ChangshengZhao,LemengWu,JunChen,ChenchenZhu,ZechunLiu,FanyiXiao,Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spatiotemporal adaptive compression for long video-language understanding. In ICML, 2025

  27. [27]

    A Simple Baseline for Streaming Video Understanding

    Yujiao Shen, Shulin Tian, Jingkang Yang, and Ziwei Liu. A simple baseline for streaming video understanding.arXiv preprint arXiv:2604.02317, 2026

  28. [28]

    Video understanding with large language models: A survey.IEEE Transactions on Circuits and Systems for Video Technology, 2025

    Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, et al. Video understanding with large language models: A survey.IEEE Transactions on Circuits and Systems for Video Technology, 2025

  29. [29]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan,ShiboWang,etal. Gemini1.5: Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext.arXiv:2403.05530, 2024

  30. [30]

    Streambridge: Turning your offline video large language model into a proactive streaming assistant.NeurIPS, 2025

    Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, and Ping Huang. Streambridge: Turning your offline video large language model into a proactive streaming assistant.NeurIPS, 2025

  31. [31]

    Chatvideo: Atracklet-centric multimodal and versatile video understanding system.arXiv preprint arXiv:2304.14407, 2023

    JunkeWang,DongdongChen,ChongLuo,XiyangDai,LuYuan,ZuxuanWu,andYu-GangJiang. Chatvideo: Atracklet-centric multimodal and versatile video understanding system.arXiv preprint arXiv:2304.14407, 2023

  32. [32]

    To see is to believe: Prompting gpt-4v for better visual instruction tuning

    Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, and Yu-Gang Jiang. To see is to believe: Prompting gpt-4v for better visual instruction tuning.arXiv preprint arXiv:2311.07574, 2023

  33. [33]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  34. [34]

    Internvideo2

    Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, Min Dou, Kai Chen, Wenhai Wang, Yu Qiao, Yali Wang, and Limin Wang. Internvideo2.5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025

  35. [35]

    CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

    Hang Wu, Yujun Cai, Zehao Li, Haonan Ge, Bowen Sun, Junsong Yuan, and Yiwei Wang. Camreasoner: Reinforcing camera movement understanding via structured spatial reasoning.arXiv preprint arXiv:2602.00181, 2026

  36. [36]

    Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 2024

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 2024

  37. [37]

    Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026

    Yiweng Xie, Bo He, Junke Wang, Xiangyu Zheng, Ziyi Ye, and Zuxuan Wu. Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026

  38. [38]

    arXiv preprint arXiv:2510.09608 , year=

    Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams.arXiv preprint arXiv:2510.09608, 2025

  39. [39]

    Streammem: Query-agnostic kv cache memory for streaming video understanding.arXiv preprint arXiv:2508.15717, 2025

    Yanlai Yang, Zhuokai Zhao, Satya Narayan Shukla, Aashu Singh, Shlok Kumar Mishra, Lizhu Zhang, and Mengye Ren. Streammem: Query-agnostic kv cache memory for streaming video understanding.arXiv preprint arXiv:2508.15717, 2025

  40. [40]

    Timechat-online: 80% visual tokens are naturally redundant in streaming videos

    Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, et al. Timechat-online: 80% visual tokens are naturally redundant in streaming videos. InACM MM, 2025

  41. [41]

    Streamforest: Efficient online video understanding with persistent event memory.NeurIPS, 2025

    Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, et al. Streamforest: Efficient online video understanding with persistent event memory.NeurIPS, 2025. 12

  42. [42]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    HangZhang,XinLi,andLidongBing. Video-llama: Aninstruction-tunedaudio-visuallanguagemodelforvideounderstanding. EMNLP, 2023. URLhttps://arxiv.org/abs/2306.02858

  43. [43]

    Flash-vstream: Memory-based real-time understanding for long video streams

    Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory-based real-time understanding for long video streams. InICCV, 2025

  44. [44]

    HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

    Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, and Xipeng Qiu. Hermes: Kv cache as hierarchical memory for efficient streaming video understanding.arXiv preprint arXiv:2601.14724, 2026

  45. [45]

    Long Context Transfer from Language to Vision

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024

  46. [46]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024

  47. [47]

    Weavetime: Stream from earlier frames into emergent memory in videollms.arXiv preprint arXiv:2602.22142, 2026

    Yulin Zhang, Cheng Shi, and Sibei Yang. Weavetime: Stream from earlier frames into emergent memory in videollms.arXiv preprint arXiv:2602.22142, 2026. 13 A Detailed Problem Formulation and Analysis This section extends the streaming constraints introduced in the main paper, provides a taxonomy of visual memory representations, discusses the perception–mem...

  48. [48]

    Whatobjectsarevisible?

    Proxy-query-guided.Compressionusesafixedsetofgenericpseudo-questions(e.g.,“Whatobjectsarevisible?”) that are determined at system initialization and remain constant across all videos and queries. These carry no information about𝑞and serve only as content-agnostic importance priors. This is compliant with C2. StreamMem [39] adopts this approach, using chat...

  49. [49]

    What objects are visible in the scene ?

    Purely visual.Compression relies exclusively on visual-level signals such as inter-frame cosine similarity. This is the strictest form of query-agnostic operation. C3 — Bounded memory.A compliant method must maintain a memory footprint bounded by a constant𝐵 independent of video length. Boundedness must hold foreverymemory tier: if any tier grows without ...