pith. sign in

arxiv: 2606.08615 · v1 · pith:443LYT3Pnew · submitted 2026-06-07 · 💻 cs.CV · cs.CL

Harnessing Streaming Video in the Wild

Pith reviewed 2026-06-27 18:46 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords streaming videovision-language modelsStreaming-Train-248KStreaming HarnessStreaming-Evalproactive interactionlong-term memoryreal-time processing
0
0 comments X

The pith

A dataset, training objective, and plug-and-play harness adapt any vision-language model for proactive, memory-rich streaming video understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds Streaming-Train-248K, a large dataset of streaming video paired with a new training objective, to teach VLMs to handle ongoing streams rather than offline clips. It then introduces Streaming Harness, a system added to any existing VLM backbone that supplies three abilities: deciding when to respond each second, retaining up to twelve hours of context, and delivering sub-second latency. A new benchmark called Streaming-Eval is provided to measure performance across diverse real-world streaming scenarios. Experiments show these changes produce consistent gains on the required streaming capabilities. The work supplies the data, code, and benchmark to support a shift toward deployable streaming systems.

Core claim

By constructing the Streaming-Train-248K dataset and novel training objective to adapt VLMs, then wrapping them in the Streaming Harness system that supplies proactive per-second interaction, twelve-hour memory, and real-time sub-second processing, and evaluating on the Streaming-Eval benchmark, existing VLMs gain the core abilities needed for in-the-wild streaming video tasks.

What carries the argument

Streaming Harness, the plug-and-play system that endows any VLM with proactive interaction, long-term memory, and real-time processing.

If this is right

  • VLMs gain the ability to make per-second decisions about when to respond during live streams.
  • Models retain context across up to twelve hours of continuous video input.
  • Processing achieves sub-second latency while handling unbounded streams.
  • Consistent performance gains appear across all measured streaming capabilities.
  • Open release of the dataset, code, and benchmark supports community development of streaming systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applications such as live video-call assistants or embodied robots could run on unmodified base VLMs once the harness is attached.
  • The benchmark may expose limitations in offline-trained models that standard video benchmarks miss.
  • Widespread adoption of the harness could standardize evaluation practices for streaming video models.

Load-bearing premise

The Streaming-Train-248K dataset and novel training objective will adapt existing VLMs to in-the-wild streaming tasks without introducing unmeasured domain gaps or overfitting.

What would settle it

Running the base VLM and the harness-equipped version on Streaming-Eval and finding no consistent gains, or worse performance, on the proactive, memory, and latency metrics would falsify the central claim.

read the original abstract

Vision-Language Models (VLMs) are increasingly required to process unbounded video streams in applications such as video-call assistants, live commentary, and embodied robots. An ideal streaming system should support proactive interaction, long-horizon memory, and real-time processing, while resting on a VLM backbone capable of handling diverse in-the-wild streaming tasks. However, existing VLMs excel at offline video understanding but fall short in streaming capabilities and lack dedicated infrastructure for streaming deployment. We address this gap on three fronts. (i) For backbone capability, we construct \textbf{Streaming-Train-248K}, a streaming dataset paired with a novel training objective for adapting VLMs to streaming interaction and understanding. (ii) For real-world deployment, we introduce \textbf{Streaming Harness}, a plug-and-play system that endows any VLM with three core abilities: proactive interaction (per-second response decisions), long-term memory (12-hour context retention), and real-time processing (sub-second latency). (iii) To drive continued community progress on streaming capabilities, we design \textbf{Streaming-Eval}, a benchmark that reflects models' capabilities across diverse in-the-wild scenarios. Extensive experiments demonstrate consistent gains from our approach across all core capabilities required for streaming video understanding. We will open-source our data, code, and benchmark to advance the community's shift from offline video understanding to deployable streaming intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper addresses limitations of existing VLMs in handling unbounded streaming video by introducing three contributions: (i) the Streaming-Train-248K dataset paired with a novel training objective to adapt VLMs for streaming interaction and understanding, (ii) the Streaming Harness, a plug-and-play system that adds proactive per-second response decisions, 12-hour context retention, and sub-second latency to any VLM backbone, and (iii) the Streaming-Eval benchmark for diverse in-the-wild streaming scenarios. The authors claim that extensive experiments demonstrate consistent gains across all core streaming capabilities, and they commit to open-sourcing the data, code, and benchmark.

Significance. If the reported gains hold under the described protocols, the work is significant for shifting the field from offline video understanding toward practical, deployable streaming systems in applications such as video-call assistants and embodied robots. A notable strength is the explicit commitment to open-source the dataset, code, and benchmark, which directly supports community progress. The plug-and-play nature of the harness, without requiring backbone modifications, offers a pragmatic engineering contribution if the ablations confirm robustness across model scales.

minor comments (3)
  1. [Abstract] Abstract: the claim of 'consistent gains' and 'extensive experiments' would be more informative if the abstract included at least one concrete metric, baseline name, or effect size (e.g., accuracy delta on Streaming-Eval) rather than remaining entirely qualitative.
  2. The description of the novel training objective and the three abilities of the Streaming Harness should include explicit pseudocode or algorithmic steps in the main text (not only in supplementary material) to allow immediate reproduction by readers.
  3. Ensure that the experimental protocol section states the exact VLM backbones, model sizes, and training hyperparameters used for the reported gains, so that the 'consistent gains across all core capabilities' claim can be directly verified.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. The review correctly identifies the core contributions and their potential impact on practical streaming video systems. We address the report below.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents engineering contributions consisting of a new dataset (Streaming-Train-248K), a plug-and-play system (Streaming Harness), and a benchmark (Streaming-Eval), followed by empirical evaluations showing gains. No mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on the construction and testing of these independent artifacts rather than any reduction to prior inputs by definition or self-reference. This is the expected non-finding for a systems-oriented paper without a derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; all contributions are described as constructed datasets, systems, and benchmarks without additional postulated mechanisms.

pith-pipeline@v0.9.1-grok · 5805 in / 1064 out tokens · 24373 ms · 2026-06-27T18:46:28.551271+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. How Well Can Your Video Model Remember? Measuring Memory-Budget Trade-offs in Long Video Understanding

    cs.CV 2026-06 unverdicted novelty 6.0

    Fits a model where logit-accuracy scales linearly in log frame budget B with distance-dependent exponent α(D) that decays log-linearly with temporal distance D, based on 155k binary predictions across ten models.

Reference graph

Works this paper leans on

74 extracted references · 19 canonical work pages · cited by 1 Pith paper

  1. [1]

    A review of deep learning for video captioning.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–20, 2024

    Moloud Abdar, Meenakshi Kollati, Swaraja Kuraparthi, Farhad Pourpanah, Daniel McDuff, Mohammad Ghavamzadeh, Shuicheng Yan, Abduallah Mohamed, Abbas Khosravi, Erik Cambria, and Fatih Porikli. A review of deep learning for video captioning.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–20, 2024. doi: 10.1109/TPAMI.2024.3522295

  2. [2]

    Qwen3-vl technical report, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  3. [3]

    Yuwei Bao, Keunwoo Yu, Yichi Zhang, Shane Storks, Itamar Bar-Yossef, Alex de la Iglesia, Megan Su, Xiao Zheng, and Joyce Chai. Can foundation models watch, talk and guide you step by step to make a cake? In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12325–12341, Singapore...

  4. [4]

    Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity

    ByteDance Seed. Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity. Available at ByteDance Seed Model Cards, 2026

  5. [5]

    Don’t pause! every prediction matters in a streaming video, 2026

    Dibyadip Chatterjee, Zhanzhong Pang, Fadime Sener, Yale Song, and Angela Yao. Don’t pause! every prediction matters in a streaming video, 2026. URLhttps://arxiv.org/abs/2604.24317

  6. [6]

    2024 , pages =

    Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18407–18418, 2024. doi: 10.1109/CVPR52733.2024.01742

  7. [7]

    mild sadness

    Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. Live: Learning video llm with stream- ing speech transcription at scale. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 29083–29095, 2025. doi: 10.1109/CVPR52734.2025.02708

  8. [8]

    The epic-kitchens dataset: Collection, challenges and baselines, 2020

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. The epic-kitchens dataset: Collection, challenges and baselines, 2020. URLhttps://arxiv.org/abs/2005.00343

  9. [9]

    Streaming video question-answering with in-context video KV-cache retrieval

    Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, TaoZhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, and Hao Jiang. Streaming video question-answering with in-context video KV-cache retrieval. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum? id=8g9fs6mdEG

  10. [10]

    Stream- mind: Unlocking full frame rate streaming video dialogue through event-gated cognition

    Xin Ding, Hao Wu, Yifan Yang, Shiqi Jiang, Qianxi Zhang, Donglin Bai, Zhibo Chen, and Ting Cao. Stream- mind: Unlocking full frame rate streaming video dialogue through event-gated cognition. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13448–13459, October 2025

  11. [11]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 2025

    ChaoyouFu, YuhanDai, YongdongLuo, LeiLi, ShuhuaiRen, RenruiZhang, ZihanWang, ChenyuZhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 2025. URLht...

  12. [12]

    Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding, 2026

    Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, Yongkang Xie, Xiawu Zheng, Xue Yang, Haoyu Cao, Yunsheng Wu, Ziwei Liu, Xing Sun, Caifeng Shan, and Ran He. Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding, 2026. URLhttps://arxiv.org/abs/...

  13. [13]

    Gemini 3 pro model card

    Google DeepMind. Gemini 3 pro model card. Available at Google DeepMind Model Cards, 2025. 12

  14. [14]

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jack- son Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, San- thosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent C...

  15. [15]

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Ku- mar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Ge- breselasie, Sanjay Haresh, Jing Huang, Md ...

  16. [16]

    Shot2story: A new benchmark for comprehensive understanding of multi-shot videos, 2025

    Mingfei Han, Linjie Yang, Xiaojun Chang, Lina Yao, and Heng Wang. Shot2story: A new benchmark for comprehensive understanding of multi-shot videos, 2025. URLhttps://arxiv.org/abs/2312.10300

  17. [17]

    Activitynet: A large-scale video benchmark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 961–970, 2015. doi: 10.1109/CVPR.2015.7298698

  18. [18]

    Localizing moments in video with natural language

    Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In2017 IEEE International Conference on Computer Vision (ICCV), pages 5804–5813, 2017. doi: 10.1109/ICCV.2017.618

  19. [19]

    Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos, 2025

    Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos, 2025. URLhttps://arxiv.org/abs/ 2501.13826

  20. [20]

    Egoexolearn: A dataset for bridging asynchronous ego- and exo-centric view of procedural activities in real world, 2025

    Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, Limin Wang, and Yu Qiao. Egoexolearn: A dataset for bridging asynchronous ego- and exo-centric view of procedural activities in real world, 2025. URLhttps://arxiv.org/abs/2403.16182

  21. [21]

    mild sadness

    Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, and Limin Wang. Online video understanding: Ovbench and videochat-online. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3328–3338, 2025. doi: 10.1109/CVPR52734.2025. 00316

  22. [22]

    in the wild

    Haroon Idrees, Amir R. Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. The thumos challenge on action recognition for videos “in the wild”.Computer Vision and Image Understanding, 155:1–23, February 2017. ISSN 1077-3142. doi: 10.1016/j.cviu.2016.10.018. URLhttp://dx.doi.org/10.1016/j. cviu.2016.10.018. 13

  23. [23]

    V-rex: Real-time streaming video llm acceler- ation via dynamic kv cache retrieval

    Donghyuk Kim, Sejeong Yang, Wonjin Shin, and Joo-Young Kim. V-rex: Real-time streaming video llm acceler- ation via dynamic kv cache retrieval. In2026 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 1–14, 2026. doi: 10.1109/HPCA68181.2026.11408603

  24. [24]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention,

  25. [25]

    URLhttps://arxiv.org/abs/2309.06180

  26. [26]

    Lion-fs: Fast & slow video-language thinker as online video assistant

    Wei Li, Bing Hu, Rui Shao, Leyang Shen, and Liqiang Nie. Lion-fs: Fast & slow video-language thinker as online video assistant. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3240–3251, 2025

  27. [27]

    Egolive: A large-scale egocentric dataset from real-world human tasks, 2026

    Yihang Li, Xuelong Wei, Jingzhou Luo, Yingjing Xiao, Yibo Bai, Guangyuan Zhou, Teng Zou, Chenguang Gui, Jiajun Wen, He Zhang, Kangliang Chen, Xing Pan, Shuaiyan Liu, Daming Wang, Tao An, Jiayi Li, Shibo Jin, Wanwan Zhang, Tianyu Wang, Boren Wei, Zhixuan Huang, Fangsheng Liu, Ruodai Li, Hui Zhang, Anson Li, Yicheng Gong, Peng Cao, Jiaming Liang, and Liang ...

  28. [28]

    Streamingbench: Assessing the gap for mllms to achieve streaming video understanding

    Junming Lin, Zheng Fang, Chi Chen, Haoxuan Cheng, Zihao Wan, Fuwen Luo, Ziyue Wang, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding. In ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12147–12151, 2026

  29. [29]

    Phostream: Benchmarking real-world streaming for omnimodal assistants in mobile scenarios, 2026

    Xudong Lu, Huankang Guan, Yang Bo, Jinpeng Chen, Xintong Guo, Shuhan Li, Fang Liu, Peiwen Sun, Xueying Li, Wei Zhang, Xue Yang, Rui Liu, and Hongsheng Li. Phostream: Benchmarking real-world streaming for omnimodal assistants in mobile scenarios, 2026. URLhttps://arxiv.org/abs/2601.22575

  30. [30]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023

    WonJun Moon, Sangeek Hyun, SangUk Park, Dongchan Park, and Jae-Pil Heo. Query - dependent video representation for moment retrieval and highlight detection. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23023–23033, 2023. doi: 10.1109/CVPR52729.2023.02205

  31. [31]

    Junbo Niu, Yifei Li, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, and Jiaqi Wang. Ovo-bench: How far is your video-llms from real-world online video understanding? In 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages ...

  32. [32]

    Henriques, Yang Liu, Andrew Zisserman, and Samuel Albanie

    Andreea-Maria Oncescu, João F. Henriques, Yang Liu, Andrew Zisserman, and Samuel Albanie. Queryd: A video dataset with high-quality text and audio narrations. InICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2265–2269, 2021. doi: 10.1109/ICASSP39728.2021. 9414640

  33. [33]

    mild sadness

    Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction . In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24045– 24055, Los Alamitos, CA, USA, June 2025. IEEE ...

  34. [34]

    Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction

    Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24045–24055, 2025

  35. [35]

    2024 , url =

    Mengxue Qu, Xiaodong Chen, Wu Liu, Alicia Li, and Yao Zhao. ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models . In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1847–1856, Los Alamitos, CA, USA, June 2024. IEEE Computer Society. doi: 10.1109/CVPRW63382.2024.00191. URLhttps://...

  36. [36]

    A ConvNet for the 2020s , booktitle =

    Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21064–21074, 2022. doi: 10.1109/CVPR52688.2022.02042

  37. [37]

    A simple baseline for streaming video understanding,

    Yujiao Shen, Shulin Tian, Jingkang Yang, and Ziwei Liu. A simple baseline for streaming video understanding,

  38. [38]

    URLhttps://arxiv.org/abs/2604.02317. 14

  39. [39]

    Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari

    Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari. Charades-ego: A large-scale dataset of paired third and first person videos, 2018. URLhttps://arxiv.org/abs/1804.09626

  40. [40]

    2024 , pages =

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, and Gaoang Wang. Moviechat: From dense token to sparse memory for long video understanding. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18221–18232, 2024. doi: 10.110...

  41. [41]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen...

  42. [42]

    Streambridge: Turning your offline video large language model into a proactive streaming assistant

    Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, and Ping Huang. Streambridge: Turning your offline video large language model into a proactive streaming assistant. In The Thirty-ninthAnnualConference on Neural Information Processing Systems, 2026. URLhttps://openreview. net/forum?id=DTvviEnW2A

  43. [43]

    Lvbench: An extreme long video understanding benchmark, 2025

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Lvbench: An extreme long video understanding benchmark, 2025. URL https://arxiv.org/abs/2406.08035

  44. [44]

    Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world, 2023

    Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, and Marc Pollefeys. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world, 2023. URLhttps://arxiv.org/abs/2309.17024. 15

  45. [45]

    Accelerating streaming video large language models via hierarchical token compression, 2026

    Yiyu Wang, Xuyang Liu, Xiyan Gui, Xinying Lin, Boxue Yang, Chenfei Liao, Tailai Chen, and Linfeng Zhang. Accelerating streaming video large language models via hierarchical token compression, 2026. URLhttps:// arxiv.org/abs/2512.00891

  46. [46]

    VideoLLM knows when to speak: Enhancing time-sensitive video comprehension with video-text duet interaction format

    Yueqian Wang, Xiaojun Meng, Yuxuan Wang, Jianxin Liang, Jiansheng Wei, Huishuai Zhang, and Dongyan Zhao. VideoLLM knows when to speak: Enhancing time-sensitive video comprehension with video-text duet interaction format. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational...

  47. [47]

    ISBN 979-8-89176-335-7

    Association for Computational Linguistics. ISBN 979-8-89176-335-7. doi: 10.18653/v1/2025.findings-emnlp

  48. [48]

    URLhttps://aclanthology.org/2025.findings-emnlp.336/

  49. [49]

    Omnimmi: A comprehen- sive multi-modal interaction benchmark in streaming video contexts, 2025

    Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, and Zilong Zheng. Omnimmi: A comprehen- sive multi-modal interaction benchmark in streaming video contexts, 2025. URLhttps://arxiv.org/abs/2503. 22952

  50. [50]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cis- tac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Huggingface’s transformers: ...

  51. [51]

    Longvideobench: A benchmark for long-context interleaved video-language understanding, 2024

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding, 2024. URLhttps://arxiv.org/abs/2407.15754

  52. [52]

    VideoLLM-mod: Efficient video-language streaming with mixture-of-depths vision computation

    Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, and Mike Zheng Shou. VideoLLM-mod: Efficient video-language streaming with mixture-of-depths vision computation. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=NKPXHzYusG

  53. [53]

    Streaming video instruction tuning,

    Jiaer Xia, Peixian Chen, Mengdan Zhang, Xing Sun, and Kaiyang Zhou. Streaming video instruction tuning,

  54. [54]

    URLhttps://arxiv.org/abs/2512.21334

  55. [55]

    Egoblind: Towards egocentric visual assistance for the blind, 2025

    Junbin Xiao, Nanxin Huang, Hao Qiu, Zhulin Tao, Xun Yang, Richang Hong, Meng Wang, and Angela Yao. Egoblind: Towards egocentric visual assistance for the blind, 2025. URLhttps://arxiv.org/abs/2503.08221

  56. [56]

    Fluxmem: Adaptive hierarchical memory for streaming video understanding, 2026

    Yiweng Xie, Bo He, Junke Wang, Xiangyu Zheng, Ziyi Ye, and Zuxuan Wu. Fluxmem: Adaptive hierarchical memory for streaming video understanding, 2026. URLhttps://arxiv.org/abs/2603.02096

  57. [57]

    Streaming video understanding and multi-round interaction with memory-enhanced knowledge

    Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, and Huchuan Lu. Streaming video understanding and multi-round interaction with memory-enhanced knowledge. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum? id=JbPb6RieNC

  58. [58]

    Streamingvlm: Real-time understanding for infinite video streams, 2025

    Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams, 2025. URLhttps://arxiv.org/abs/2510.09608

  59. [59]

    StreamingVLM: Real-time understanding for infinite video streams

    Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. StreamingVLM: Real-time understanding for infinite video streams. In The Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=gVbPWbA97s

  60. [60]

    RTV-bench: Benchmark- ing MLLM continuous perception, understanding and reasoning through real-time video

    ShuHang Xun, Sicheng Tao, Jungang Li, Yibo Shi, Zhixin Lin, Zhanhui Zhu, Yibo Yan, Hanqian Li, Ling- Hao Zhang, Shikang Wang, Yixin Liu, Hanbo Zhang, Ying Ma, and Xuming Hu. RTV-bench: Benchmark- ing MLLM continuous perception, understanding and reasoning through real-time video. InThe Thirty-ninth Annual Conference on Neural Information Processing System...

  61. [61]

    Proact-vl: A proactive videollm for real-time ai companions, 2026

    Weicai Yan, Yuhong Dai, Qi Ran, Haodong Li, Wang Lin, Hao Liao, Xing Xie, Tao Jin, and Jianxun Lian. Proact-vl: A proactive videollm for real-time ai companions, 2026. URLhttps://arxiv.org/abs/2603.03447

  62. [62]

    Livestar: Live streaming assistant for real-world online video under- standing

    Zhenyu Yang, Kairui Zhang, Yuhang Hu, Bing Wang, Shengsheng Qian, Bin Wen, Fan Yang, Tingting Gao, Weiming Dong, and Changsheng Xu. Livestar: Live streaming assistant for real-world online video under- standing. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=4n7IifN7yr. 16

  63. [63]

    TailorKV: A hybrid framework for long-context inference via tailored KV cache optimization

    Dingyu Yao, Bowen Shen, Zheng Lin, Wei Liu, Jian Luan, Bin Wang, and Weiping Wang. TailorKV: A hybrid framework for long-context inference via tailored KV cache optimization. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 20340–20359, Vienna...

  64. [64]

    Vecinfer: Efficient llm inference with low-bit kv cache via outlier-suppressed vector quantization, 2025

    Dingyu Yao, Chenxu Yang, Zhengyang Tong, Zheng Lin, Wei Liu, Jian Luan, and Weiping Wang. Vecinfer: Efficient llm inference with low-bit kv cache via outlier-suppressed vector quantization, 2025. URLhttps:// arxiv.org/abs/2510.06175

  65. [65]

    Timechat-online: 80% visual tokens are naturally redundant in streaming videos

    Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, et al. Timechat-online: 80% visual tokens are naturally redundant in streaming videos. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10807–10816, 2025

  66. [66]

    Surveillance video-and-language understanding: From small to large multimodal models.IEEE Trans

    Tongtong Yuan, Xuange Zhang, Bo Liu, Kun Liu, Jian Jin, and Zhenzhen Jiao. Surveillance video-and-language understanding: From small to large multimodal models.IEEE Trans. Cir. and Sys. for Video Technol., 35(1): 300–314, January 2025. ISSN 1051-8215. doi: 10.1109/TCSVT.2024.3462433. URLhttps://doi.org/10.1109/ TCSVT.2024.3462433

  67. [67]

    Hierar- chical video-moment retrieval and step-captioning, 2023

    Abhay Zala, Jaemin Cho, Satwik Kottur, Xilun Chen, Barlas Oğuz, Yasher Mehdad, and Mohit Bansal. Hierar- chical video-moment retrieval and step-captioning, 2023. URLhttps://arxiv.org/abs/2303.16406

  68. [68]

    Streamforest: Efficient online video understanding with persistent event memory

    Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, Yi Wang, and Limin Wang. Streamforest: Efficient online video understanding with persistent event memory. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=9loSPaBwGO

  69. [69]

    Flash-vstream: Efficient real-time understanding for long video streams, 2025

    Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, and Xiaojie Jin. Flash-vstream: Efficient real-time understanding for long video streams, 2025. URLhttps://arxiv.org/abs/2506.23825

  70. [70]

    Hermes: Kv cache as hierarchical memory for efficient streaming video understanding, 2026

    Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, and Xipeng Qiu. Hermes: Kv cache as hierarchical memory for efficient streaming video understanding, 2026. URLhttps://arxiv.org/abs/2601.14724

  71. [71]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs, 2024. URLhttps://arxiv.org/abs/2312.07104

  72. [72]

    Mlvu: Benchmarking multi-task long video understanding,

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: Benchmarking multi-task long video understanding,

  73. [73]

    URLhttps://arxiv.org/abs/2406.04264

  74. [74]

    winner":

    LuoweiZhou, ChenliangXu, andJasonJ.Corso. Towardsautomaticlearningofproceduresfromwebinstructional videos, 2017. URLhttps://arxiv.org/abs/1703.09788. 17 Appendix A Limitations Our current framework operates in a vision-language model setting and does not incorporate the audio modal- ity. Integrating an omni-modal large model with speech understanding capa...