Harnessing Streaming Video in the Wild
Pith reviewed 2026-06-27 18:46 UTC · model grok-4.3
The pith
A dataset, training objective, and plug-and-play harness adapt any vision-language model for proactive, memory-rich streaming video understanding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By constructing the Streaming-Train-248K dataset and novel training objective to adapt VLMs, then wrapping them in the Streaming Harness system that supplies proactive per-second interaction, twelve-hour memory, and real-time sub-second processing, and evaluating on the Streaming-Eval benchmark, existing VLMs gain the core abilities needed for in-the-wild streaming video tasks.
What carries the argument
Streaming Harness, the plug-and-play system that endows any VLM with proactive interaction, long-term memory, and real-time processing.
If this is right
- VLMs gain the ability to make per-second decisions about when to respond during live streams.
- Models retain context across up to twelve hours of continuous video input.
- Processing achieves sub-second latency while handling unbounded streams.
- Consistent performance gains appear across all measured streaming capabilities.
- Open release of the dataset, code, and benchmark supports community development of streaming systems.
Where Pith is reading between the lines
- Applications such as live video-call assistants or embodied robots could run on unmodified base VLMs once the harness is attached.
- The benchmark may expose limitations in offline-trained models that standard video benchmarks miss.
- Widespread adoption of the harness could standardize evaluation practices for streaming video models.
Load-bearing premise
The Streaming-Train-248K dataset and novel training objective will adapt existing VLMs to in-the-wild streaming tasks without introducing unmeasured domain gaps or overfitting.
What would settle it
Running the base VLM and the harness-equipped version on Streaming-Eval and finding no consistent gains, or worse performance, on the proactive, memory, and latency metrics would falsify the central claim.
read the original abstract
Vision-Language Models (VLMs) are increasingly required to process unbounded video streams in applications such as video-call assistants, live commentary, and embodied robots. An ideal streaming system should support proactive interaction, long-horizon memory, and real-time processing, while resting on a VLM backbone capable of handling diverse in-the-wild streaming tasks. However, existing VLMs excel at offline video understanding but fall short in streaming capabilities and lack dedicated infrastructure for streaming deployment. We address this gap on three fronts. (i) For backbone capability, we construct \textbf{Streaming-Train-248K}, a streaming dataset paired with a novel training objective for adapting VLMs to streaming interaction and understanding. (ii) For real-world deployment, we introduce \textbf{Streaming Harness}, a plug-and-play system that endows any VLM with three core abilities: proactive interaction (per-second response decisions), long-term memory (12-hour context retention), and real-time processing (sub-second latency). (iii) To drive continued community progress on streaming capabilities, we design \textbf{Streaming-Eval}, a benchmark that reflects models' capabilities across diverse in-the-wild scenarios. Extensive experiments demonstrate consistent gains from our approach across all core capabilities required for streaming video understanding. We will open-source our data, code, and benchmark to advance the community's shift from offline video understanding to deployable streaming intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper addresses limitations of existing VLMs in handling unbounded streaming video by introducing three contributions: (i) the Streaming-Train-248K dataset paired with a novel training objective to adapt VLMs for streaming interaction and understanding, (ii) the Streaming Harness, a plug-and-play system that adds proactive per-second response decisions, 12-hour context retention, and sub-second latency to any VLM backbone, and (iii) the Streaming-Eval benchmark for diverse in-the-wild streaming scenarios. The authors claim that extensive experiments demonstrate consistent gains across all core streaming capabilities, and they commit to open-sourcing the data, code, and benchmark.
Significance. If the reported gains hold under the described protocols, the work is significant for shifting the field from offline video understanding toward practical, deployable streaming systems in applications such as video-call assistants and embodied robots. A notable strength is the explicit commitment to open-source the dataset, code, and benchmark, which directly supports community progress. The plug-and-play nature of the harness, without requiring backbone modifications, offers a pragmatic engineering contribution if the ablations confirm robustness across model scales.
minor comments (3)
- [Abstract] Abstract: the claim of 'consistent gains' and 'extensive experiments' would be more informative if the abstract included at least one concrete metric, baseline name, or effect size (e.g., accuracy delta on Streaming-Eval) rather than remaining entirely qualitative.
- The description of the novel training objective and the three abilities of the Streaming Harness should include explicit pseudocode or algorithmic steps in the main text (not only in supplementary material) to allow immediate reproduction by readers.
- Ensure that the experimental protocol section states the exact VLM backbones, model sizes, and training hyperparameters used for the reported gains, so that the 'consistent gains across all core capabilities' claim can be directly verified.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and recommendation for minor revision. The review correctly identifies the core contributions and their potential impact on practical streaming video systems. We address the report below.
Circularity Check
No significant circularity
full rationale
The paper presents engineering contributions consisting of a new dataset (Streaming-Train-248K), a plug-and-play system (Streaming Harness), and a benchmark (Streaming-Eval), followed by empirical evaluations showing gains. No mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on the construction and testing of these independent artifacts rather than any reduction to prior inputs by definition or self-reference. This is the expected non-finding for a systems-oriented paper without a derivation chain.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
How Well Can Your Video Model Remember? Measuring Memory-Budget Trade-offs in Long Video Understanding
Fits a model where logit-accuracy scales linearly in log frame budget B with distance-dependent exponent α(D) that decays log-linearly with temporal distance D, based on 155k binary predictions across ten models.
Reference graph
Works this paper leans on
-
[1]
Moloud Abdar, Meenakshi Kollati, Swaraja Kuraparthi, Farhad Pourpanah, Daniel McDuff, Mohammad Ghavamzadeh, Shuicheng Yan, Abduallah Mohamed, Abbas Khosravi, Erik Cambria, and Fatih Porikli. A review of deep learning for video captioning.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–20, 2024. doi: 10.1109/TPAMI.2024.3522295
-
[2]
Qwen3-vl technical report, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
Pith/arXiv arXiv 2025
-
[3]
Yuwei Bao, Keunwoo Yu, Yichi Zhang, Shane Storks, Itamar Bar-Yossef, Alex de la Iglesia, Megan Su, Xiao Zheng, and Joyce Chai. Can foundation models watch, talk and guide you step by step to make a cake? In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12325–12341, Singapore...
-
[4]
Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity
ByteDance Seed. Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity. Available at ByteDance Seed Model Cards, 2026
2026
-
[5]
Don’t pause! every prediction matters in a streaming video, 2026
Dibyadip Chatterjee, Zhanzhong Pang, Fadime Sener, Yale Song, and Angela Yao. Don’t pause! every prediction matters in a streaming video, 2026. URLhttps://arxiv.org/abs/2604.24317
Pith/arXiv arXiv 2026
-
[6]
Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18407–18418, 2024. doi: 10.1109/CVPR52733.2024.01742
-
[7]
Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. Live: Learning video llm with stream- ing speech transcription at scale. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 29083–29095, 2025. doi: 10.1109/CVPR52734.2025.02708
-
[8]
The epic-kitchens dataset: Collection, challenges and baselines, 2020
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. The epic-kitchens dataset: Collection, challenges and baselines, 2020. URLhttps://arxiv.org/abs/2005.00343
arXiv 2020
-
[9]
Streaming video question-answering with in-context video KV-cache retrieval
Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, TaoZhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, and Hao Jiang. Streaming video question-answering with in-context video KV-cache retrieval. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum? id=8g9fs6mdEG
2025
-
[10]
Stream- mind: Unlocking full frame rate streaming video dialogue through event-gated cognition
Xin Ding, Hao Wu, Yifan Yang, Shiqi Jiang, Qianxi Zhang, Donglin Bai, Zhibo Chen, and Ting Cao. Stream- mind: Unlocking full frame rate streaming video dialogue through event-gated cognition. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13448–13459, October 2025
2025
-
[11]
ChaoyouFu, YuhanDai, YongdongLuo, LeiLi, ShuhuaiRen, RenruiZhang, ZihanWang, ChenyuZhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 2025. URLht...
Pith/arXiv arXiv 2025
-
[12]
Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding, 2026
Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, Yongkang Xie, Xiawu Zheng, Xue Yang, Haoyu Cao, Yunsheng Wu, Ziwei Liu, Xing Sun, Caifeng Shan, and Ran He. Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding, 2026. URLhttps://arxiv.org/abs/...
Pith/arXiv arXiv 2026
-
[13]
Gemini 3 pro model card
Google DeepMind. Gemini 3 pro model card. Available at Google DeepMind Model Cards, 2025. 12
2025
-
[14]
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jack- son Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, San- thosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent C...
-
[15]
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Ku- mar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Ge- breselasie, Sanjay Haresh, Jing Huang, Md ...
arXiv 2024
-
[16]
Shot2story: A new benchmark for comprehensive understanding of multi-shot videos, 2025
Mingfei Han, Linjie Yang, Xiaojun Chang, Lina Yao, and Heng Wang. Shot2story: A new benchmark for comprehensive understanding of multi-shot videos, 2025. URLhttps://arxiv.org/abs/2312.10300
arXiv 2025
-
[17]
Activitynet: A large-scale video benchmark for human activity understanding
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 961–970, 2015. doi: 10.1109/CVPR.2015.7298698
-
[18]
Localizing moments in video with natural language
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In2017 IEEE International Conference on Computer Vision (ICCV), pages 5804–5813, 2017. doi: 10.1109/ICCV.2017.618
-
[19]
Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos, 2025
Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos, 2025. URLhttps://arxiv.org/abs/ 2501.13826
Pith/arXiv arXiv 2025
-
[20]
Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, Limin Wang, and Yu Qiao. Egoexolearn: A dataset for bridging asynchronous ego- and exo-centric view of procedural activities in real world, 2025. URLhttps://arxiv.org/abs/2403.16182
arXiv 2025
-
[21]
Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, and Limin Wang. Online video understanding: Ovbench and videochat-online. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3328–3338, 2025. doi: 10.1109/CVPR52734.2025. 00316
-
[22]
Haroon Idrees, Amir R. Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. The thumos challenge on action recognition for videos “in the wild”.Computer Vision and Image Understanding, 155:1–23, February 2017. ISSN 1077-3142. doi: 10.1016/j.cviu.2016.10.018. URLhttp://dx.doi.org/10.1016/j. cviu.2016.10.018. 13
-
[23]
V-rex: Real-time streaming video llm acceler- ation via dynamic kv cache retrieval
Donghyuk Kim, Sejeong Yang, Wonjin Shin, and Joo-Young Kim. V-rex: Real-time streaming video llm acceler- ation via dynamic kv cache retrieval. In2026 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 1–14, 2026. doi: 10.1109/HPCA68181.2026.11408603
-
[24]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention,
-
[25]
URLhttps://arxiv.org/abs/2309.06180
-
[26]
Lion-fs: Fast & slow video-language thinker as online video assistant
Wei Li, Bing Hu, Rui Shao, Leyang Shen, and Liqiang Nie. Lion-fs: Fast & slow video-language thinker as online video assistant. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3240–3251, 2025
2025
-
[27]
Egolive: A large-scale egocentric dataset from real-world human tasks, 2026
Yihang Li, Xuelong Wei, Jingzhou Luo, Yingjing Xiao, Yibo Bai, Guangyuan Zhou, Teng Zou, Chenguang Gui, Jiajun Wen, He Zhang, Kangliang Chen, Xing Pan, Shuaiyan Liu, Daming Wang, Tao An, Jiayi Li, Shibo Jin, Wanwan Zhang, Tianyu Wang, Boren Wei, Zhixuan Huang, Fangsheng Liu, Ruodai Li, Hui Zhang, Anson Li, Yicheng Gong, Peng Cao, Jiaming Liang, and Liang ...
Pith/arXiv arXiv 2026
-
[28]
Streamingbench: Assessing the gap for mllms to achieve streaming video understanding
Junming Lin, Zheng Fang, Chi Chen, Haoxuan Cheng, Zihao Wan, Fuwen Luo, Ziyue Wang, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding. In ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12147–12151, 2026
2026
-
[29]
Phostream: Benchmarking real-world streaming for omnimodal assistants in mobile scenarios, 2026
Xudong Lu, Huankang Guan, Yang Bo, Jinpeng Chen, Xintong Guo, Shuhan Li, Fang Liu, Peiwen Sun, Xueying Li, Wei Zhang, Xue Yang, Rui Liu, and Hongsheng Li. Phostream: Benchmarking real-world streaming for omnimodal assistants in mobile scenarios, 2026. URLhttps://arxiv.org/abs/2601.22575
arXiv 2026
-
[30]
WonJun Moon, Sangeek Hyun, SangUk Park, Dongchan Park, and Jae-Pil Heo. Query - dependent video representation for moment retrieval and highlight detection. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23023–23033, 2023. doi: 10.1109/CVPR52729.2023.02205
-
[31]
Junbo Niu, Yifei Li, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, and Jiaqi Wang. Ovo-bench: How far is your video-llms from real-world online video understanding? In 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages ...
2025
-
[32]
Henriques, Yang Liu, Andrew Zisserman, and Samuel Albanie
Andreea-Maria Oncescu, João F. Henriques, Yang Liu, Andrew Zisserman, and Samuel Albanie. Queryd: A video dataset with high-quality text and audio narrations. InICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2265–2269, 2021. doi: 10.1109/ICASSP39728.2021. 9414640
-
[33]
Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction . In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24045– 24055, Los Alamitos, CA, USA, June 2025. IEEE ...
-
[34]
Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction
Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24045–24055, 2025
2025
-
[35]
Mengxue Qu, Xiaodong Chen, Wu Liu, Alicia Li, and Yao Zhao. ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models . In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1847–1856, Los Alamitos, CA, USA, June 2024. IEEE Computer Society. doi: 10.1109/CVPRW63382.2024.00191. URLhttps://...
-
[36]
Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21064–21074, 2022. doi: 10.1109/CVPR52688.2022.02042
-
[37]
A simple baseline for streaming video understanding,
Yujiao Shen, Shulin Tian, Jingkang Yang, and Ziwei Liu. A simple baseline for streaming video understanding,
-
[38]
URLhttps://arxiv.org/abs/2604.02317. 14
-
[39]
Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari
Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari. Charades-ego: A large-scale dataset of paired third and first person videos, 2018. URLhttps://arxiv.org/abs/1804.09626
Pith/arXiv arXiv 2018
-
[40]
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, and Gaoang Wang. Moviechat: From dense token to sparse memory for long video understanding. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18221–18232, 2024. doi: 10.110...
-
[41]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen...
Pith/arXiv arXiv 2026
-
[42]
Streambridge: Turning your offline video large language model into a proactive streaming assistant
Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, and Ping Huang. Streambridge: Turning your offline video large language model into a proactive streaming assistant. In The Thirty-ninthAnnualConference on Neural Information Processing Systems, 2026. URLhttps://openreview. net/forum?id=DTvviEnW2A
2026
-
[43]
Lvbench: An extreme long video understanding benchmark, 2025
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Lvbench: An extreme long video understanding benchmark, 2025. URL https://arxiv.org/abs/2406.08035
Pith/arXiv arXiv 2025
-
[44]
Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, and Marc Pollefeys. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world, 2023. URLhttps://arxiv.org/abs/2309.17024. 15
arXiv 2023
-
[45]
Accelerating streaming video large language models via hierarchical token compression, 2026
Yiyu Wang, Xuyang Liu, Xiyan Gui, Xinying Lin, Boxue Yang, Chenfei Liao, Tailai Chen, and Linfeng Zhang. Accelerating streaming video large language models via hierarchical token compression, 2026. URLhttps:// arxiv.org/abs/2512.00891
arXiv 2026
-
[46]
VideoLLM knows when to speak: Enhancing time-sensitive video comprehension with video-text duet interaction format
Yueqian Wang, Xiaojun Meng, Yuxuan Wang, Jianxin Liang, Jiansheng Wei, Huishuai Zhang, and Dongyan Zhao. VideoLLM knows when to speak: Enhancing time-sensitive video comprehension with video-text duet interaction format. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational...
2025
-
[47]
Association for Computational Linguistics. ISBN 979-8-89176-335-7. doi: 10.18653/v1/2025.findings-emnlp
-
[48]
URLhttps://aclanthology.org/2025.findings-emnlp.336/
2025
-
[49]
Omnimmi: A comprehen- sive multi-modal interaction benchmark in streaming video contexts, 2025
Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, and Zilong Zheng. Omnimmi: A comprehen- sive multi-modal interaction benchmark in streaming video contexts, 2025. URLhttps://arxiv.org/abs/2503. 22952
2025
-
[50]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cis- tac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Huggingface’s transformers: ...
Pith/arXiv arXiv 2020
-
[51]
Longvideobench: A benchmark for long-context interleaved video-language understanding, 2024
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding, 2024. URLhttps://arxiv.org/abs/2407.15754
Pith/arXiv arXiv 2024
-
[52]
VideoLLM-mod: Efficient video-language streaming with mixture-of-depths vision computation
Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, and Mike Zheng Shou. VideoLLM-mod: Efficient video-language streaming with mixture-of-depths vision computation. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=NKPXHzYusG
2024
-
[53]
Streaming video instruction tuning,
Jiaer Xia, Peixian Chen, Mengdan Zhang, Xing Sun, and Kaiyang Zhou. Streaming video instruction tuning,
-
[54]
URLhttps://arxiv.org/abs/2512.21334
-
[55]
Egoblind: Towards egocentric visual assistance for the blind, 2025
Junbin Xiao, Nanxin Huang, Hao Qiu, Zhulin Tao, Xun Yang, Richang Hong, Meng Wang, and Angela Yao. Egoblind: Towards egocentric visual assistance for the blind, 2025. URLhttps://arxiv.org/abs/2503.08221
arXiv 2025
-
[56]
Fluxmem: Adaptive hierarchical memory for streaming video understanding, 2026
Yiweng Xie, Bo He, Junke Wang, Xiangyu Zheng, Ziyi Ye, and Zuxuan Wu. Fluxmem: Adaptive hierarchical memory for streaming video understanding, 2026. URLhttps://arxiv.org/abs/2603.02096
arXiv 2026
-
[57]
Streaming video understanding and multi-round interaction with memory-enhanced knowledge
Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, and Huchuan Lu. Streaming video understanding and multi-round interaction with memory-enhanced knowledge. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum? id=JbPb6RieNC
2025
-
[58]
Streamingvlm: Real-time understanding for infinite video streams, 2025
Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams, 2025. URLhttps://arxiv.org/abs/2510.09608
Pith/arXiv arXiv 2025
-
[59]
StreamingVLM: Real-time understanding for infinite video streams
Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. StreamingVLM: Real-time understanding for infinite video streams. In The Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=gVbPWbA97s
2026
-
[60]
RTV-bench: Benchmark- ing MLLM continuous perception, understanding and reasoning through real-time video
ShuHang Xun, Sicheng Tao, Jungang Li, Yibo Shi, Zhixin Lin, Zhanhui Zhu, Yibo Yan, Hanqian Li, Ling- Hao Zhang, Shikang Wang, Yixin Liu, Hanbo Zhang, Ying Ma, and Xuming Hu. RTV-bench: Benchmark- ing MLLM continuous perception, understanding and reasoning through real-time video. InThe Thirty-ninth Annual Conference on Neural Information Processing System...
2026
-
[61]
Proact-vl: A proactive videollm for real-time ai companions, 2026
Weicai Yan, Yuhong Dai, Qi Ran, Haodong Li, Wang Lin, Hao Liao, Xing Xie, Tao Jin, and Jianxun Lian. Proact-vl: A proactive videollm for real-time ai companions, 2026. URLhttps://arxiv.org/abs/2603.03447
Pith/arXiv arXiv 2026
-
[62]
Livestar: Live streaming assistant for real-world online video under- standing
Zhenyu Yang, Kairui Zhang, Yuhang Hu, Bing Wang, Shengsheng Qian, Bin Wen, Fan Yang, Tingting Gao, Weiming Dong, and Changsheng Xu. Livestar: Live streaming assistant for real-world online video under- standing. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=4n7IifN7yr. 16
2025
-
[63]
TailorKV: A hybrid framework for long-context inference via tailored KV cache optimization
Dingyu Yao, Bowen Shen, Zheng Lin, Wei Liu, Jian Luan, Bin Wang, and Weiping Wang. TailorKV: A hybrid framework for long-context inference via tailored KV cache optimization. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 20340–20359, Vienna...
-
[64]
Dingyu Yao, Chenxu Yang, Zhengyang Tong, Zheng Lin, Wei Liu, Jian Luan, and Weiping Wang. Vecinfer: Efficient llm inference with low-bit kv cache via outlier-suppressed vector quantization, 2025. URLhttps:// arxiv.org/abs/2510.06175
arXiv 2025
-
[65]
Timechat-online: 80% visual tokens are naturally redundant in streaming videos
Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, et al. Timechat-online: 80% visual tokens are naturally redundant in streaming videos. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10807–10816, 2025
2025
-
[66]
Surveillance video-and-language understanding: From small to large multimodal models.IEEE Trans
Tongtong Yuan, Xuange Zhang, Bo Liu, Kun Liu, Jian Jin, and Zhenzhen Jiao. Surveillance video-and-language understanding: From small to large multimodal models.IEEE Trans. Cir. and Sys. for Video Technol., 35(1): 300–314, January 2025. ISSN 1051-8215. doi: 10.1109/TCSVT.2024.3462433. URLhttps://doi.org/10.1109/ TCSVT.2024.3462433
-
[67]
Hierar- chical video-moment retrieval and step-captioning, 2023
Abhay Zala, Jaemin Cho, Satwik Kottur, Xilun Chen, Barlas Oğuz, Yasher Mehdad, and Mohit Bansal. Hierar- chical video-moment retrieval and step-captioning, 2023. URLhttps://arxiv.org/abs/2303.16406
arXiv 2023
-
[68]
Streamforest: Efficient online video understanding with persistent event memory
Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, Yi Wang, and Limin Wang. Streamforest: Efficient online video understanding with persistent event memory. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=9loSPaBwGO
2026
-
[69]
Flash-vstream: Efficient real-time understanding for long video streams, 2025
Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, and Xiaojie Jin. Flash-vstream: Efficient real-time understanding for long video streams, 2025. URLhttps://arxiv.org/abs/2506.23825
arXiv 2025
-
[70]
Hermes: Kv cache as hierarchical memory for efficient streaming video understanding, 2026
Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, and Xipeng Qiu. Hermes: Kv cache as hierarchical memory for efficient streaming video understanding, 2026. URLhttps://arxiv.org/abs/2601.14724
Pith/arXiv arXiv 2026
-
[71]
Gonzalez, Clark Barrett, and Ying Sheng
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs, 2024. URLhttps://arxiv.org/abs/2312.07104
Pith/arXiv arXiv 2024
-
[72]
Mlvu: Benchmarking multi-task long video understanding,
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: Benchmarking multi-task long video understanding,
-
[73]
URLhttps://arxiv.org/abs/2406.04264
-
[74]
LuoweiZhou, ChenliangXu, andJasonJ.Corso. Towardsautomaticlearningofproceduresfromwebinstructional videos, 2017. URLhttps://arxiv.org/abs/1703.09788. 17 Appendix A Limitations Our current framework operates in a vision-language model setting and does not incorporate the audio modal- ity. Integrating an omni-modal large model with speech understanding capa...
Pith/arXiv arXiv 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.