arxiv: 2605.07897 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding

Hang Wu, Ming-Hsuan Yang, Sherin Mary Mathews, Yiwei Wang, Yujun Cai

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:09 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords streaming video understandingvisual memory managementsemantic-aware compressionquery-adaptive retrievaltraining-free frameworkvision-language modelsreal-time video processing

0 comments

The pith

Semantic signals from a fixed question bank let models retain relevant frames in streaming videos and retrieve them adaptively, improving accuracy while halving memory use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SAVEMem, a training-free dual-stage framework that manages memory for vision-language models handling continuous video streams and real-time queries. Existing approaches rely on visual similarity for compression or add retrieval after the fact, but SAVEMem incorporates semantic priors early to decide what to keep long-term and adapts retrieval scope per query. A fixed pseudo-question bank shapes retention in a three-tier memory under a fixed budget. An anchor-conditioned recency gate then expands or contracts the retrieval window from short-term to long-term memory depending on the query's temporal target. Late interaction within that window selects frames for the answer. When added to Qwen2.5-VL, the method raises OVO-Bench scores from 52.27 to 62.69 while cutting peak GPU memory by 48 percent at 128 frames.

Core claim

SAVEMem builds a three-tier streaming memory online under constant budget where a fixed pseudo-question bank supplies semantic salience to guide long-term retention instead of visual similarity alone; a second stage then applies query-aware retrieval via an anchor-conditioned recency gate that adapts scope across short-, mid-, and long-term tiers before late interaction selects candidate frames.

What carries the argument

Dual-stage SAVEMem: semantic prior from fixed pseudo-question bank for three-tier memory generation, plus anchor-conditioned recency gate for query-adaptive retrieval scope.

If this is right

Applied without training, SAVEMem raises OVO-Bench overall score from 52.27 to 62.69 on Qwen2.5-VL.
Consistent gains appear on StreamingBench and ODV-Bench under the same zero-training setting.
Peak GPU memory at 128 frames falls by 48 percent relative to the unmodified backbone.
The three-tier memory plus adaptive retrieval coordinates compression and retrieval in one pipeline rather than treating them separately.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same semantic-prior approach could be tested on other vision-language backbones to check whether the gains transfer without retraining.
Varying the size or content of the pseudo-question bank might reveal how much semantic coverage is needed for different video domains.
Because retrieval scope now adapts to query timing, the method may reduce unnecessary token loading in very long streams where most queries target recent frames.

Load-bearing premise

A fixed pseudo-question bank supplies a lightweight yet effective semantic prior that shapes long-term retention decisions better than visual similarity alone.

What would settle it

Removing the pseudo-question bank from Stage 1 or the recency gate from Stage 2 and measuring whether OVO-Bench, StreamingBench, and ODV-Bench scores drop back to or below the Qwen2.5-VL baseline at the same memory budget.

read the original abstract

Online streaming video understanding requires models to process continuous visual inputs and respond to user queries in real time, where the unbounded stream and unpredictable query timing turn memory management into a central challenge. Existing methods typically compress visual tokens via visual similarity heuristics, or augment compression with KV-cache-level retrieval. However, compression decisions rarely incorporate semantic signals, and retrieval is often added after compression is finalized, making the two stages hard to coordinate. We present SAVEMem, a training-free dual-stage framework that brings semantic awareness into memory generation and lets the retrieval scope adapt per query. In Stage~1, SAVEMem builds a three-tier streaming memory online under a constant memory budget. A fixed pseudo-question bank provides a lightweight semantic prior, so that long-term retention is shaped by semantic salience rather than visual similarity alone. In Stage~2, SAVEMem performs query-aware retrieval over this memory. An anchor-conditioned recency gate adapts the retrieval scope from short-term to mid- and long-term memory based on whether the query targets the present or the distant past. Within this scope, late interaction between query and memory tokens selects candidate frames for answering. Applied to Qwen2.5-VL without training, SAVEMem improves the OVO-Bench overall score from 52.27 to 62.69 and yields consistent gains on StreamingBench and ODV-Bench, while reducing peak GPU memory by 48\% at 128 frames over the backbone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAVEMem adds a semantic prior via fixed pseudo-question bank to streaming memory construction plus query-adaptive retrieval scope, delivering usable gains without training but lacking the ablation that isolates the bank's contribution.

read the letter

The paper's central idea is to make memory decisions for streaming video semantically informed from the start using a pseudo-question bank, rather than relying only on visual similarity, and then to let the query itself control how much of the memory to look at through an adaptive gate. That dual-stage framing is the main novelty over the visual-heuristic baselines cited in the abstract. What works is the training-free design and the reported outcomes. It slots into Qwen2.5-VL and boosts OVO-Bench from 52.27 to 62.69, with consistent improvements on StreamingBench and ODV-Bench. The 48% memory reduction at 128 frames is a clear practical benefit for anyone running these models on limited hardware. The soft spot is the validation of the semantic prior. The abstract ties the long-term retention gains to the pseudo-question bank, but it does not include an ablation that replaces the bank with a pure visual similarity method under identical conditions. Without that, it's difficult to confirm that the semantic signal is the driver rather than the overall three-tier structure or the recency gate. The paper also gives limited information on the bank's construction and whether it generalizes beyond the tested benchmarks. This is for researchers in efficient video understanding and real-time multimodal systems. A reader who needs memory-efficient ways to handle long video streams without retraining would find the framework and numbers directly useful. It deserves a serious referee. The approach is novel enough and the results concrete enough that review would help clarify the contributions and tighten the experiments. I would recommend putting it through peer review.

Referee Report

3 major / 2 minor

Summary. The manuscript presents SAVEMem, a training-free dual-stage framework for streaming video understanding. In Stage 1, it builds a three-tier streaming memory online under constant budget, using a fixed pseudo-question bank to shape long-term retention by semantic salience rather than visual similarity alone. In Stage 2, it performs query-aware retrieval with an anchor-conditioned recency gate that adapts scope from short- to long-term memory, followed by late interaction for frame selection. Applied to Qwen2.5-VL, it reports OVO-Bench improvement from 52.27 to 62.69, consistent gains on StreamingBench and ODV-Bench, and 48% peak GPU memory reduction at 128 frames.

Significance. If the results hold under proper controls, this would represent a meaningful advance in efficient, training-free memory management for vision-language models handling unbounded video streams. The training-free application to an existing backbone, the explicit constant-budget constraint, and the reported memory savings are clear strengths. Consistent cross-benchmark gains suggest practical relevance for real-time streaming tasks.

major comments (3)

[Abstract / Stage 1] Abstract and Stage 1 description: The central claim attributes the +10.42 OVO-Bench gain (and consistent lifts elsewhere) to the semantic prior from the fixed pseudo-question bank outperforming visual-similarity heuristics. However, no ablation is described that swaps only this bank for a pure visual baseline while freezing the three-tier structure, constant budget, and Stage 2 components. Without this isolation, the semantic-awareness contribution cannot be confirmed as the causal driver.
[Abstract / Experiments] Abstract and Experiments section: The headline numeric improvements are presented without details on experimental controls, baseline re-implementations, statistical significance, variance across runs, or checks for data-selection effects. This leaves the support for the reported gains provisional and makes it difficult to assess whether the gains generalize beyond the specific evaluation setup.
[Stage 1] Stage 1 description: No construction details are supplied for the pseudo-question bank (size, source, curation process, or cross-benchmark generality). If the bank is derived from or tuned to the evaluated benchmarks (OVO-Bench, StreamingBench, ODV-Bench), the semantic prior risks circularity and the claim that it supplies an independent lightweight semantic signal would be weakened.

minor comments (2)

[Abstract] Abstract: The phrase 'consistent gains' is used without reporting the per-benchmark deltas or absolute scores; adding these numbers would improve precision and allow readers to gauge effect sizes directly.
[Abstract] Notation and terminology: The terms 'three-tier streaming memory' and 'anchor-conditioned recency gate' are introduced without a brief inline gloss in the abstract; a short parenthetical definition on first use would aid accessibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications on the current manuscript and committing to specific revisions that strengthen the presentation of SAVEMem without altering its core claims.

read point-by-point responses

Referee: [Abstract / Stage 1] Abstract and Stage 1 description: The central claim attributes the +10.42 OVO-Bench gain (and consistent lifts elsewhere) to the semantic prior from the fixed pseudo-question bank outperforming visual-similarity heuristics. However, no ablation is described that swaps only this bank for a pure visual baseline while freezing the three-tier structure, constant budget, and Stage 2 components. Without this isolation, the semantic-awareness contribution cannot be confirmed as the causal driver.

Authors: We agree that a controlled ablation replacing only the pseudo-question bank with a pure visual-similarity retention policy (while freezing the three-tier memory structure, constant budget, and all Stage 2 components) would provide the cleanest isolation of the semantic prior's contribution. The current manuscript demonstrates gains over existing visual-heuristic methods but does not include this exact swap. We will add this ablation in the revised version to directly support the causal attribution. revision: yes
Referee: [Abstract / Experiments] Abstract and Experiments section: The headline numeric improvements are presented without details on experimental controls, baseline re-implementations, statistical significance, variance across runs, or checks for data-selection effects. This leaves the support for the reported gains provisional and makes it difficult to assess whether the gains generalize beyond the specific evaluation setup.

Authors: We acknowledge the need for greater transparency on experimental rigor. The reported results follow the standard protocols of OVO-Bench, StreamingBench, and ODV-Bench, and the method is fully deterministic given fixed inputs and the fixed pseudo-question bank. In the revision we will expand the Experiments section with explicit details on baseline re-implementations, any statistical significance checks performed, run-to-run variance (expected to be zero), and explicit checks confirming that gains are not driven by data-selection artifacts. revision: yes
Referee: [Stage 1] Stage 1 description: No construction details are supplied for the pseudo-question bank (size, source, curation process, or cross-benchmark generality). If the bank is derived from or tuned to the evaluated benchmarks (OVO-Bench, StreamingBench, ODV-Bench), the semantic prior risks circularity and the claim that it supplies an independent lightweight semantic signal would be weakened.

Authors: The pseudo-question bank is a fixed, benchmark-agnostic collection of generic questions intended to capture common semantic aspects of video streams. We will add a new subsection in the revised Stage 1 description that fully specifies its size, source, curation process, and evidence of generality across benchmarks, thereby confirming that it functions as an independent lightweight semantic prior and eliminating any appearance of circularity. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper describes a training-free dual-stage framework applied directly to the pre-existing Qwen2.5-VL backbone. Reported gains on OVO-Bench, StreamingBench, and ODV-Bench are presented as empirical outcomes of the method rather than quantities derived by construction from fitted parameters or self-referential definitions. No equations, self-citations, or procedures in the abstract or described pipeline reduce the central claims (semantic prior via fixed pseudo-question bank, adaptive retrieval) to inputs by definition. The method is explicitly positioned as not requiring training or benchmark-specific fitting, satisfying the criteria for a self-contained, non-circular presentation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework rests on the assumption that a fixed pseudo-question bank can serve as an effective semantic prior and that the recency gate can reliably adapt retrieval scope; these are methodological choices rather than fitted parameters or new physical entities.

axioms (1)

domain assumption A fixed pseudo-question bank provides a sufficient semantic prior for shaping long-term memory retention
Invoked in the description of Stage 1 memory construction

invented entities (2)

three-tier streaming memory no independent evidence
purpose: Organize visual tokens under constant budget with semantic salience guiding retention
Core component introduced in Stage 1
anchor-conditioned recency gate no independent evidence
purpose: Dynamically adjust retrieval scope from short-term to long-term memory based on query timing
Core component introduced in Stage 2

pith-pipeline@v0.9.0 · 5569 in / 1430 out tokens · 49190 ms · 2026-05-11T02:09:00.735347+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
Stage 1 builds a three-tier streaming memory online under a constant memory budget. A fixed pseudo-question bank provides a lightweight semantic prior, so that long-term retention is shaped by semantic salience rather than visual similarity alone.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
we score each visual token v against a fixed pseudo-question bank Q via late-interaction MaxSim: s(v) = max_{q in Q} cos(v, q)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 12 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Token merging: Your ViT but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InICLR, 2023

work page 2023
[3]

Videollm-online: Online video large language model for streaming video

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InCVPR, 2024

work page 2024
[4]

End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[5]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024

work page 2024
[6]

Streaming video question-answering with in-context video kv-cache retrieval

Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, Hao Jiang, et al. Streaming video question-answering with in-context video kv-cache retrieval. InICLR, 2025

work page 2025
[7]

Contextnav: Towards agentic multimodal in-context learning.arXiv preprint arXiv:2510.04560, 2025

Honghao Fu, Yuan Ouyang, Kai-Wei Chang, Yiwei Wang, Zi Huang, and Yujun Cai. Contextnav: Towards agentic multimodal in-context learning.arXiv preprint arXiv:2510.04560, 2025

work page arXiv 2025
[8]

Vispeak: Visual instruction feedback in streaming videos.ICCV, 2025

Shenghao Fu, Qize Yang, Yuan-Ming Li, Yi-Xing Peng, Kun-Yu Lin, Xihan Wei, Jian-Fang Hu, Xiaohua Xie, and Wei-Shi Zheng. Vispeak: Visual instruction feedback in streaming videos.ICCV, 2025

work page 2025
[9]

Framemind: Frame-interleaved video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025

Haonan Ge, Yiwei Wang, Kai-Wei Chang, Hang Wu, and Yujun Cai. Framemind: Frame-interleaved video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025

work page arXiv 2025
[10]

Online video understanding: Ovbench and videochat-online

Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, and Limin Wang. Online video understanding: Ovbench and videochat-online. InCVPR, pages 3328–3338, 2025

work page 2025
[11]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Colbert: Efficient and effective passage search via contextualized late interaction over bert

Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48, 2020

work page 2020
[13]

Interaction methods for smart glasses: A survey.IEEE access, 6:28712–28732, 2018

Lik-Hang Lee and Pan Hui. Interaction methods for smart glasses: A survey.IEEE access, 6:28712–28732, 2018

work page 2018
[14]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

arXiv preprint arXiv:2411.03628 , year=

JunmingLin,ZhengFang,ChiChen,ZihaoWan,FuwenLuo,PengLi,YangLiu,andMaosongSun. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding.arXiv preprint arXiv:2411.03628, 2024

work page arXiv 2024
[16]

Aligning cyber space with physical world: A comprehensive survey on embodied ai.IEEE/ASME Transactions on Mechatronics, 2025

Yang Liu, Weixing Chen, Yongjie Bai, Xiaodan Liang, Guanbin Li, Wen Gao, and Liang Lin. Aligning cyber space with physical world: A comprehensive survey on embodied ai.IEEE/ASME Transactions on Mechatronics, 2025

work page 2025
[17]

Thinking in streaming video.arXiv preprint arXiv:2603.12938, 2026

Zikang Liu, Longteng Guo, Handong Li, Ru Zhen, Xingjian He, Ruyi Ji, Xiaoming Ren, Yanhao Zhang, Haonan Lu, and Jing Liu. Thinking in streaming video.arXiv preprint arXiv:2603.12938, 2026

work page arXiv 2026
[18]

Video-chatgpt: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InACL, 2024

work page 2024
[19]

A Survey of Context Engineering for Large Language Models

Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, et al. A survey of context engineering for large language models.arXiv preprint arXiv:2507.13334, 2025

work page internal anchor Pith review arXiv 2025
[20]

Gated differentiable working memory for long-context language modeling.arXiv preprint arXiv:2601.12906, 2026

Lingrui Mei, Shenghua Liu, Yiwei Wang, Yuyao Ge, Baolong Bi, Jiayu Yao, Jun Wan, Ziling Yin, Jiafeng Guo, and Xueqi Cheng. Gated differentiable working memory for long-context language modeling.arXiv preprint arXiv:2601.12906, 2026

work page arXiv 2026
[21]

LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

ZhenyuNing,GuangdaLiu,QihaoJin,WenchaoDing,MinyiGuo,andJieruZhao. Livevlm: Efficientonlinevideounderstanding via streaming-oriented kv cache and retrieval.arXiv preprint arXiv:2505.15269, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Ovo-bench: How far is your video-llms from real-world online video understanding? InCVPR, pages 18902–18913, 2025

JunboNiu,YifeiLi,ZiyangMiao,ChunjiangGe,YuanhangZhou,QihaoHe,XiaoyiDong,HaodongDuan,ShuangruiDing,Rui Qian, et al. Ovo-bench: How far is your video-llms from real-world online video understanding? InCVPR, pages 18902–18913, 2025

work page 2025
[23]

Athresholdselectionmethodfromgray-levelhistograms.IEEETransactionsonSystems,Man,andCybernetics, 9(1):62–66, 1979

NobuyukiOtsu. Athresholdselectionmethodfromgray-levelhistograms.IEEETransactionsonSystems,Man,andCybernetics, 9(1):62–66, 1979

work page 1979
[24]

Streaming long video understanding with large language models.NeurIPS, 37:119336–119360, 2024

Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models.NeurIPS, 37:119336–119360, 2024

work page 2024
[25]

Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction

Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction. InCVPR, 2025

work page 2025
[26]

Longvu: Spatiotemporal adaptive compression for long video-language understanding

XiaoqianShen,YunyangXiong,ChangshengZhao,LemengWu,JunChen,ChenchenZhu,ZechunLiu,FanyiXiao,Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spatiotemporal adaptive compression for long video-language understanding. In ICML, 2025

work page 2025
[27]

A Simple Baseline for Streaming Video Understanding

Yujiao Shen, Shulin Tian, Jingkang Yang, and Ziwei Liu. A simple baseline for streaming video understanding.arXiv preprint arXiv:2604.02317, 2026

work page arXiv 2026
[28]

Video understanding with large language models: A survey.IEEE Transactions on Circuits and Systems for Video Technology, 2025

Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, et al. Video understanding with large language models: A survey.IEEE Transactions on Circuits and Systems for Video Technology, 2025

work page 2025
[29]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan,ShiboWang,etal. Gemini1.5: Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext.arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Streambridge: Turning your offline video large language model into a proactive streaming assistant.NeurIPS, 2025

Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, and Ping Huang. Streambridge: Turning your offline video large language model into a proactive streaming assistant.NeurIPS, 2025

work page 2025
[31]

Chatvideo: Atracklet-centric multimodal and versatile video understanding system.arXiv preprint arXiv:2304.14407, 2023

JunkeWang,DongdongChen,ChongLuo,XiyangDai,LuYuan,ZuxuanWu,andYu-GangJiang. Chatvideo: Atracklet-centric multimodal and versatile video understanding system.arXiv preprint arXiv:2304.14407, 2023

work page arXiv 2023
[32]

To see is to believe: Prompting gpt-4v for better visual instruction tuning

Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, and Yu-Gang Jiang. To see is to believe: Prompting gpt-4v for better visual instruction tuning.arXiv preprint arXiv:2311.07574, 2023

work page arXiv 2023
[33]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Internvideo2

Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, Min Dou, Kai Chen, Wenhai Wang, Yu Qiao, Yali Wang, and Limin Wang. Internvideo2.5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025

work page arXiv 2025
[35]

CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

Hang Wu, Yujun Cai, Zehao Li, Haonan Ge, Bowen Sun, Junsong Yuan, and Yiwei Wang. Camreasoner: Reinforcing camera movement understanding via structured spatial reasoning.arXiv preprint arXiv:2602.00181, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 2024

work page 2024
[37]

Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026

Yiweng Xie, Bo He, Junke Wang, Xiangyu Zheng, Ziyi Ye, and Zuxuan Wu. Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026

work page arXiv 2026
[38]

arXiv preprint arXiv:2510.09608 , year=

Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams.arXiv preprint arXiv:2510.09608, 2025

work page arXiv 2025
[39]

Streammem: Query-agnostic kv cache memory for streaming video understanding.arXiv preprint arXiv:2508.15717, 2025

Yanlai Yang, Zhuokai Zhao, Satya Narayan Shukla, Aashu Singh, Shlok Kumar Mishra, Lizhu Zhang, and Mengye Ren. Streammem: Query-agnostic kv cache memory for streaming video understanding.arXiv preprint arXiv:2508.15717, 2025

work page arXiv 2025
[40]

Timechat-online: 80% visual tokens are naturally redundant in streaming videos

Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, et al. Timechat-online: 80% visual tokens are naturally redundant in streaming videos. InACM MM, 2025

work page 2025
[41]

Streamforest: Efficient online video understanding with persistent event memory.NeurIPS, 2025

Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, et al. Streamforest: Efficient online video understanding with persistent event memory.NeurIPS, 2025. 12

work page 2025
[42]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

HangZhang,XinLi,andLidongBing. Video-llama: Aninstruction-tunedaudio-visuallanguagemodelforvideounderstanding. EMNLP, 2023. URLhttps://arxiv.org/abs/2306.02858

work page internal anchor Pith review arXiv 2023
[43]

Flash-vstream: Memory-based real-time understanding for long video streams

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory-based real-time understanding for long video streams. InICCV, 2025

work page 2025
[44]

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, and Xipeng Qiu. Hermes: Kv cache as hierarchical memory for efficient streaming video understanding.arXiv preprint arXiv:2601.14724, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024

work page internal anchor Pith review arXiv 2024
[46]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Weavetime: Stream from earlier frames into emergent memory in videollms.arXiv preprint arXiv:2602.22142, 2026

Yulin Zhang, Cheng Shi, and Sibei Yang. Weavetime: Stream from earlier frames into emergent memory in videollms.arXiv preprint arXiv:2602.22142, 2026. 13 A Detailed Problem Formulation and Analysis This section extends the streaming constraints introduced in the main paper, provides a taxonomy of visual memory representations, discusses the perception–mem...

work page arXiv 2026
[48]

Whatobjectsarevisible?

Proxy-query-guided.Compressionusesafixedsetofgenericpseudo-questions(e.g.,“Whatobjectsarevisible?”) that are determined at system initialization and remain constant across all videos and queries. These carry no information about𝑞and serve only as content-agnostic importance priors. This is compliant with C2. StreamMem [39] adopts this approach, using chat...

work page
[49]

What objects are visible in the scene ?

Purely visual.Compression relies exclusively on visual-level signals such as inter-frame cosine similarity. This is the strictest form of query-agnostic operation. C3 — Bounded memory.A compliant method must maintain a memory footprint bounded by a constant𝐵 independent of video length. Boundedness must hold foreverymemory tier: if any tier grows without ...

work page