Recognition: 2 theorem links
· Lean TheoremSemantic-Aware Adaptive Visual Memory for Streaming Video Understanding
Pith reviewed 2026-05-11 02:09 UTC · model grok-4.3
The pith
Semantic signals from a fixed question bank let models retain relevant frames in streaming videos and retrieve them adaptively, improving accuracy while halving memory use.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SAVEMem builds a three-tier streaming memory online under constant budget where a fixed pseudo-question bank supplies semantic salience to guide long-term retention instead of visual similarity alone; a second stage then applies query-aware retrieval via an anchor-conditioned recency gate that adapts scope across short-, mid-, and long-term tiers before late interaction selects candidate frames.
What carries the argument
Dual-stage SAVEMem: semantic prior from fixed pseudo-question bank for three-tier memory generation, plus anchor-conditioned recency gate for query-adaptive retrieval scope.
If this is right
- Applied without training, SAVEMem raises OVO-Bench overall score from 52.27 to 62.69 on Qwen2.5-VL.
- Consistent gains appear on StreamingBench and ODV-Bench under the same zero-training setting.
- Peak GPU memory at 128 frames falls by 48 percent relative to the unmodified backbone.
- The three-tier memory plus adaptive retrieval coordinates compression and retrieval in one pipeline rather than treating them separately.
Where Pith is reading between the lines
- The same semantic-prior approach could be tested on other vision-language backbones to check whether the gains transfer without retraining.
- Varying the size or content of the pseudo-question bank might reveal how much semantic coverage is needed for different video domains.
- Because retrieval scope now adapts to query timing, the method may reduce unnecessary token loading in very long streams where most queries target recent frames.
Load-bearing premise
A fixed pseudo-question bank supplies a lightweight yet effective semantic prior that shapes long-term retention decisions better than visual similarity alone.
What would settle it
Removing the pseudo-question bank from Stage 1 or the recency gate from Stage 2 and measuring whether OVO-Bench, StreamingBench, and ODV-Bench scores drop back to or below the Qwen2.5-VL baseline at the same memory budget.
read the original abstract
Online streaming video understanding requires models to process continuous visual inputs and respond to user queries in real time, where the unbounded stream and unpredictable query timing turn memory management into a central challenge. Existing methods typically compress visual tokens via visual similarity heuristics, or augment compression with KV-cache-level retrieval. However, compression decisions rarely incorporate semantic signals, and retrieval is often added after compression is finalized, making the two stages hard to coordinate. We present SAVEMem, a training-free dual-stage framework that brings semantic awareness into memory generation and lets the retrieval scope adapt per query. In Stage~1, SAVEMem builds a three-tier streaming memory online under a constant memory budget. A fixed pseudo-question bank provides a lightweight semantic prior, so that long-term retention is shaped by semantic salience rather than visual similarity alone. In Stage~2, SAVEMem performs query-aware retrieval over this memory. An anchor-conditioned recency gate adapts the retrieval scope from short-term to mid- and long-term memory based on whether the query targets the present or the distant past. Within this scope, late interaction between query and memory tokens selects candidate frames for answering. Applied to Qwen2.5-VL without training, SAVEMem improves the OVO-Bench overall score from 52.27 to 62.69 and yields consistent gains on StreamingBench and ODV-Bench, while reducing peak GPU memory by 48\% at 128 frames over the backbone.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents SAVEMem, a training-free dual-stage framework for streaming video understanding. In Stage 1, it builds a three-tier streaming memory online under constant budget, using a fixed pseudo-question bank to shape long-term retention by semantic salience rather than visual similarity alone. In Stage 2, it performs query-aware retrieval with an anchor-conditioned recency gate that adapts scope from short- to long-term memory, followed by late interaction for frame selection. Applied to Qwen2.5-VL, it reports OVO-Bench improvement from 52.27 to 62.69, consistent gains on StreamingBench and ODV-Bench, and 48% peak GPU memory reduction at 128 frames.
Significance. If the results hold under proper controls, this would represent a meaningful advance in efficient, training-free memory management for vision-language models handling unbounded video streams. The training-free application to an existing backbone, the explicit constant-budget constraint, and the reported memory savings are clear strengths. Consistent cross-benchmark gains suggest practical relevance for real-time streaming tasks.
major comments (3)
- [Abstract / Stage 1] Abstract and Stage 1 description: The central claim attributes the +10.42 OVO-Bench gain (and consistent lifts elsewhere) to the semantic prior from the fixed pseudo-question bank outperforming visual-similarity heuristics. However, no ablation is described that swaps only this bank for a pure visual baseline while freezing the three-tier structure, constant budget, and Stage 2 components. Without this isolation, the semantic-awareness contribution cannot be confirmed as the causal driver.
- [Abstract / Experiments] Abstract and Experiments section: The headline numeric improvements are presented without details on experimental controls, baseline re-implementations, statistical significance, variance across runs, or checks for data-selection effects. This leaves the support for the reported gains provisional and makes it difficult to assess whether the gains generalize beyond the specific evaluation setup.
- [Stage 1] Stage 1 description: No construction details are supplied for the pseudo-question bank (size, source, curation process, or cross-benchmark generality). If the bank is derived from or tuned to the evaluated benchmarks (OVO-Bench, StreamingBench, ODV-Bench), the semantic prior risks circularity and the claim that it supplies an independent lightweight semantic signal would be weakened.
minor comments (2)
- [Abstract] Abstract: The phrase 'consistent gains' is used without reporting the per-benchmark deltas or absolute scores; adding these numbers would improve precision and allow readers to gauge effect sizes directly.
- [Abstract] Notation and terminology: The terms 'three-tier streaming memory' and 'anchor-conditioned recency gate' are introduced without a brief inline gloss in the abstract; a short parenthetical definition on first use would aid accessibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications on the current manuscript and committing to specific revisions that strengthen the presentation of SAVEMem without altering its core claims.
read point-by-point responses
-
Referee: [Abstract / Stage 1] Abstract and Stage 1 description: The central claim attributes the +10.42 OVO-Bench gain (and consistent lifts elsewhere) to the semantic prior from the fixed pseudo-question bank outperforming visual-similarity heuristics. However, no ablation is described that swaps only this bank for a pure visual baseline while freezing the three-tier structure, constant budget, and Stage 2 components. Without this isolation, the semantic-awareness contribution cannot be confirmed as the causal driver.
Authors: We agree that a controlled ablation replacing only the pseudo-question bank with a pure visual-similarity retention policy (while freezing the three-tier memory structure, constant budget, and all Stage 2 components) would provide the cleanest isolation of the semantic prior's contribution. The current manuscript demonstrates gains over existing visual-heuristic methods but does not include this exact swap. We will add this ablation in the revised version to directly support the causal attribution. revision: yes
-
Referee: [Abstract / Experiments] Abstract and Experiments section: The headline numeric improvements are presented without details on experimental controls, baseline re-implementations, statistical significance, variance across runs, or checks for data-selection effects. This leaves the support for the reported gains provisional and makes it difficult to assess whether the gains generalize beyond the specific evaluation setup.
Authors: We acknowledge the need for greater transparency on experimental rigor. The reported results follow the standard protocols of OVO-Bench, StreamingBench, and ODV-Bench, and the method is fully deterministic given fixed inputs and the fixed pseudo-question bank. In the revision we will expand the Experiments section with explicit details on baseline re-implementations, any statistical significance checks performed, run-to-run variance (expected to be zero), and explicit checks confirming that gains are not driven by data-selection artifacts. revision: yes
-
Referee: [Stage 1] Stage 1 description: No construction details are supplied for the pseudo-question bank (size, source, curation process, or cross-benchmark generality). If the bank is derived from or tuned to the evaluated benchmarks (OVO-Bench, StreamingBench, ODV-Bench), the semantic prior risks circularity and the claim that it supplies an independent lightweight semantic signal would be weakened.
Authors: The pseudo-question bank is a fixed, benchmark-agnostic collection of generic questions intended to capture common semantic aspects of video streams. We will add a new subsection in the revised Stage 1 description that fully specifies its size, source, curation process, and evidence of generality across benchmarks, thereby confirming that it functions as an independent lightweight semantic prior and eliminating any appearance of circularity. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper describes a training-free dual-stage framework applied directly to the pre-existing Qwen2.5-VL backbone. Reported gains on OVO-Bench, StreamingBench, and ODV-Bench are presented as empirical outcomes of the method rather than quantities derived by construction from fitted parameters or self-referential definitions. No equations, self-citations, or procedures in the abstract or described pipeline reduce the central claims (semantic prior via fixed pseudo-question bank, adaptive retrieval) to inputs by definition. The method is explicitly positioned as not requiring training or benchmark-specific fitting, satisfying the criteria for a self-contained, non-circular presentation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A fixed pseudo-question bank provides a sufficient semantic prior for shaping long-term memory retention
invented entities (2)
-
three-tier streaming memory
no independent evidence
-
anchor-conditioned recency gate
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearStage 1 builds a three-tier streaming memory online under a constant memory budget. A fixed pseudo-question bank provides a lightweight semantic prior, so that long-term retention is shaped by semantic salience rather than visual similarity alone.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearwe score each visual token v against a fixed pseudo-question bank Q via late-interaction MaxSim: s(v) = max_{q in Q} cos(v, q)
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Token merging: Your ViT but faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InICLR, 2023
work page 2023
-
[3]
Videollm-online: Online video large language model for streaming video
Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InCVPR, 2024
work page 2024
-
[4]
Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024
work page 2024
-
[5]
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024
work page 2024
-
[6]
Streaming video question-answering with in-context video kv-cache retrieval
Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, Hao Jiang, et al. Streaming video question-answering with in-context video kv-cache retrieval. InICLR, 2025
work page 2025
-
[7]
Contextnav: Towards agentic multimodal in-context learning.arXiv preprint arXiv:2510.04560, 2025
Honghao Fu, Yuan Ouyang, Kai-Wei Chang, Yiwei Wang, Zi Huang, and Yujun Cai. Contextnav: Towards agentic multimodal in-context learning.arXiv preprint arXiv:2510.04560, 2025
-
[8]
Vispeak: Visual instruction feedback in streaming videos.ICCV, 2025
Shenghao Fu, Qize Yang, Yuan-Ming Li, Yi-Xing Peng, Kun-Yu Lin, Xihan Wei, Jian-Fang Hu, Xiaohua Xie, and Wei-Shi Zheng. Vispeak: Visual instruction feedback in streaming videos.ICCV, 2025
work page 2025
-
[9]
Haonan Ge, Yiwei Wang, Kai-Wei Chang, Hang Wu, and Yujun Cai. Framemind: Frame-interleaved video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025
-
[10]
Online video understanding: Ovbench and videochat-online
Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, and Limin Wang. Online video understanding: Ovbench and videochat-online. InCVPR, pages 3328–3338, 2025
work page 2025
-
[11]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Colbert: Efficient and effective passage search via contextualized late interaction over bert
Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48, 2020
work page 2020
-
[13]
Interaction methods for smart glasses: A survey.IEEE access, 6:28712–28732, 2018
Lik-Hang Lee and Pan Hui. Interaction methods for smart glasses: A survey.IEEE access, 6:28712–28732, 2018
work page 2018
-
[14]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
arXiv preprint arXiv:2411.03628 , year=
JunmingLin,ZhengFang,ChiChen,ZihaoWan,FuwenLuo,PengLi,YangLiu,andMaosongSun. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding.arXiv preprint arXiv:2411.03628, 2024
-
[16]
Yang Liu, Weixing Chen, Yongjie Bai, Xiaodan Liang, Guanbin Li, Wen Gao, and Liang Lin. Aligning cyber space with physical world: A comprehensive survey on embodied ai.IEEE/ASME Transactions on Mechatronics, 2025
work page 2025
-
[17]
Thinking in streaming video.arXiv preprint arXiv:2603.12938, 2026
Zikang Liu, Longteng Guo, Handong Li, Ru Zhen, Xingjian He, Ruyi Ji, Xiaoming Ren, Yanhao Zhang, Haonan Lu, and Jing Liu. Thinking in streaming video.arXiv preprint arXiv:2603.12938, 2026
-
[18]
Video-chatgpt: Towards detailed video understanding via large vision and language models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InACL, 2024
work page 2024
-
[19]
A Survey of Context Engineering for Large Language Models
Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, et al. A survey of context engineering for large language models.arXiv preprint arXiv:2507.13334, 2025
work page internal anchor Pith review arXiv 2025
-
[20]
Lingrui Mei, Shenghua Liu, Yiwei Wang, Yuyao Ge, Baolong Bi, Jiayu Yao, Jun Wan, Ziling Yin, Jiafeng Guo, and Xueqi Cheng. Gated differentiable working memory for long-context language modeling.arXiv preprint arXiv:2601.12906, 2026
-
[21]
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval
ZhenyuNing,GuangdaLiu,QihaoJin,WenchaoDing,MinyiGuo,andJieruZhao. Livevlm: Efficientonlinevideounderstanding via streaming-oriented kv cache and retrieval.arXiv preprint arXiv:2505.15269, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
JunboNiu,YifeiLi,ZiyangMiao,ChunjiangGe,YuanhangZhou,QihaoHe,XiaoyiDong,HaodongDuan,ShuangruiDing,Rui Qian, et al. Ovo-bench: How far is your video-llms from real-world online video understanding? InCVPR, pages 18902–18913, 2025
work page 2025
-
[23]
NobuyukiOtsu. Athresholdselectionmethodfromgray-levelhistograms.IEEETransactionsonSystems,Man,andCybernetics, 9(1):62–66, 1979
work page 1979
-
[24]
Streaming long video understanding with large language models.NeurIPS, 37:119336–119360, 2024
Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models.NeurIPS, 37:119336–119360, 2024
work page 2024
-
[25]
Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction. InCVPR, 2025
work page 2025
-
[26]
Longvu: Spatiotemporal adaptive compression for long video-language understanding
XiaoqianShen,YunyangXiong,ChangshengZhao,LemengWu,JunChen,ChenchenZhu,ZechunLiu,FanyiXiao,Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spatiotemporal adaptive compression for long video-language understanding. In ICML, 2025
work page 2025
-
[27]
A Simple Baseline for Streaming Video Understanding
Yujiao Shen, Shulin Tian, Jingkang Yang, and Ziwei Liu. A simple baseline for streaming video understanding.arXiv preprint arXiv:2604.02317, 2026
-
[28]
Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, et al. Video understanding with large language models: A survey.IEEE Transactions on Circuits and Systems for Video Technology, 2025
work page 2025
-
[29]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan,ShiboWang,etal. Gemini1.5: Unlockingmultimodalunderstandingacrossmillionsoftokensofcontext.arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, and Ping Huang. Streambridge: Turning your offline video large language model into a proactive streaming assistant.NeurIPS, 2025
work page 2025
-
[31]
JunkeWang,DongdongChen,ChongLuo,XiyangDai,LuYuan,ZuxuanWu,andYu-GangJiang. Chatvideo: Atracklet-centric multimodal and versatile video understanding system.arXiv preprint arXiv:2304.14407, 2023
-
[32]
To see is to believe: Prompting gpt-4v for better visual instruction tuning
Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, and Yu-Gang Jiang. To see is to believe: Prompting gpt-4v for better visual instruction tuning.arXiv preprint arXiv:2311.07574, 2023
-
[33]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, Min Dou, Kai Chen, Wenhai Wang, Yu Qiao, Yali Wang, and Limin Wang. Internvideo2.5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025
-
[35]
CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning
Hang Wu, Yujun Cai, Zehao Li, Haonan Ge, Bowen Sun, Junsong Yuan, and Yiwei Wang. Camreasoner: Reinforcing camera movement understanding via structured spatial reasoning.arXiv preprint arXiv:2602.00181, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[36]
Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 2024
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 2024
work page 2024
-
[37]
Yiweng Xie, Bo He, Junke Wang, Xiangyu Zheng, Ziyi Ye, and Zuxuan Wu. Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026
-
[38]
arXiv preprint arXiv:2510.09608 , year=
Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams.arXiv preprint arXiv:2510.09608, 2025
-
[39]
Yanlai Yang, Zhuokai Zhao, Satya Narayan Shukla, Aashu Singh, Shlok Kumar Mishra, Lizhu Zhang, and Mengye Ren. Streammem: Query-agnostic kv cache memory for streaming video understanding.arXiv preprint arXiv:2508.15717, 2025
-
[40]
Timechat-online: 80% visual tokens are naturally redundant in streaming videos
Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, et al. Timechat-online: 80% visual tokens are naturally redundant in streaming videos. InACM MM, 2025
work page 2025
-
[41]
Streamforest: Efficient online video understanding with persistent event memory.NeurIPS, 2025
Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, et al. Streamforest: Efficient online video understanding with persistent event memory.NeurIPS, 2025. 12
work page 2025
-
[42]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
HangZhang,XinLi,andLidongBing. Video-llama: Aninstruction-tunedaudio-visuallanguagemodelforvideounderstanding. EMNLP, 2023. URLhttps://arxiv.org/abs/2306.02858
work page internal anchor Pith review arXiv 2023
-
[43]
Flash-vstream: Memory-based real-time understanding for long video streams
Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory-based real-time understanding for long video streams. InICCV, 2025
work page 2025
-
[44]
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, and Xipeng Qiu. Hermes: Kv cache as hierarchical memory for efficient streaming video understanding.arXiv preprint arXiv:2601.14724, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[45]
Long Context Transfer from Language to Vision
Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024
work page internal anchor Pith review arXiv 2024
-
[46]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Yulin Zhang, Cheng Shi, and Sibei Yang. Weavetime: Stream from earlier frames into emergent memory in videollms.arXiv preprint arXiv:2602.22142, 2026. 13 A Detailed Problem Formulation and Analysis This section extends the streaming constraints introduced in the main paper, provides a taxonomy of visual memory representations, discusses the perception–mem...
-
[48]
Proxy-query-guided.Compressionusesafixedsetofgenericpseudo-questions(e.g.,“Whatobjectsarevisible?”) that are determined at system initialization and remain constant across all videos and queries. These carry no information about𝑞and serve only as content-agnostic importance priors. This is compliant with C2. StreamMem [39] adopts this approach, using chat...
-
[49]
What objects are visible in the scene ?
Purely visual.Compression relies exclusively on visual-level signals such as inter-frame cosine similarity. This is the strictest form of query-agnostic operation. C3 — Bounded memory.A compliant method must maintain a memory footprint bounded by a constant𝐵 independent of video length. Boundedness must hold foreverymemory tier: if any tier grows without ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.