pith. machine review for the scientific record. sign in

arxiv: 2604.07634 · v2 · submitted 2026-04-08 · 💻 cs.CV

Recognition: no theorem link

VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords VSAS-Benchvisual streaming assistantsvision-language modelsproactivenessconsistencyreal-time evaluationbenchmarkstreaming video
0
0 comments X

The pith

Conventional vision-language models adapted for streaming outperform specialized streaming models without any extra training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VSAS-Bench to evaluate streaming vision-language models that must generate responses while receiving a continuous stream of video frames. Standard benchmarks test models only on complete videos after the fact, but real-time assistants also need timely responses and stable answers as new frames arrive. VSAS-Bench supplies more than 18,000 temporally dense annotations across many domains and task types, plus fixed synchronous and asynchronous protocols that measure proactiveness and consistency in addition to accuracy. Large-scale tests of recent models identify clear accuracy-latency trade-offs tied to memory buffer size, access policy, and input resolution. The central result is that ordinary VLMs can be run in streaming mode without retraining and still surpass recent models built specifically for streaming, such as a 3 percent edge for Qwen3-VL-4B over Dispider under the asynchronous protocol.

Core claim

VSAS-Bench supplies temporally dense annotations and standardized synchronous and asynchronous protocols that isolate proactiveness and consistency alongside accuracy. When this framework is applied to current video and streaming VLMs, conventional models adapted to streaming settings without additional training achieve higher performance than recent dedicated streaming VLMs.

What carries the argument

VSAS-Bench, a benchmark that supplies temporally dense annotations, synchronous and asynchronous evaluation protocols, and separate metrics for proactiveness and consistency to test real-time streaming vision-language models.

If this is right

  • Performance depends on concrete design choices such as memory buffer length, memory access policy, and input resolution, with measurable accuracy-latency trade-offs.
  • The benchmark supports standardized comparisons across both video VLMs and streaming VLMs using the new metrics.
  • Adapted conventional models establish a stronger baseline for streaming performance than current purpose-built streaming models.
  • Practical configuration guidelines emerge for balancing speed and quality in deployed streaming assistants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The result suggests streaming behavior may arise naturally from general video-language training rather than needing separate architectures or data.
  • The same protocols and metrics could be reused to evaluate streaming models in other modalities such as audio or sensor streams.
  • Teams building live assistants may achieve faster progress by adapting existing models instead of training new streaming-specific ones from scratch.

Load-bearing premise

The new metrics for proactiveness and consistency, together with the temporally dense annotations and the chosen evaluation protocols, validly capture the capabilities required for real-time visual streaming assistants.

What would settle it

An independent test on a new set of live video streams with fresh human ratings of response timeliness and consistency in which the adapted conventional models no longer outperform the specialized streaming models.

Figures

Figures reproduced from arXiv: 2604.07634 by Bo Feng, Cem Koc, Chun-Liang Li, Fartash Faghri, Hadi Pouransari, Meng Cao, Oncel Tuzel, Pavan Kumar Anasosalu Vasu, Zhengfeng Lai.

Figure 1
Figure 1. Figure 1: VSAS-BENCH accuracy uncovers advantage of small models as visual streaming assistants. Under the synchronous evaluation protocol, models run in lockstep with the camera and larger models benefit from unlimited processing time. However, in the more realistic asynchronous evaluation, smaller models out￾perform because their high inference speed allows them to respond more rapidly and effectively to the live … view at source ↗
Figure 2
Figure 2. Figure 2: VSAS-BENCH comprises densely annotated videos with a wide range of actions, varying durations, and reasoning horizons. (a) Distribution of video categories, showing the diversity of tasks covered. (b) Mean video duration (with standard deviation bars) for each video category. (c) Distribution of task types, highlighting the split between short- and long-horizon temporal reasoning. text and selective respon… view at source ↗
Figure 3
Figure 3. Figure 3: Examples of VSAS-BENCH task types with frame-level annotations. VSAS-BENCH involves three task types, Present tasks, which focus on currently occurring events; Cumulative tasks, which require the model to reason over past events; and Future tasks, which focus on predicting upcoming events based on ongoing visual cues. models incur no penalty, since frames are streamed in lock￾step with VLM model inference.… view at source ↗
Figure 4
Figure 4. Figure 4: Traditional versus realistic evaluation protocols. (top) Synchronous Protocol: Frame generation is temporally aligned with the VLM’s processing rate, ensuring camera input and model inference occur in lockstep. (bottom) Asynchronous Protocol: Frame acquisition is decoupled from inference; the VLM retrieves frames from a buffered queue, emulating real-world streaming on edge devices. Algorithm 1. The camera… view at source ↗
Figure 5
Figure 5. Figure 5: Overview of self-speculative decoding for streaming VLMs. The prior response serves as a draft to be verified for the next frame. Self-Speculative Decoding Cumulative Task (OCR) Mean Avg. Accuracy Consistency Latency (s) ✗ 21.5 93.0 5.8 ✓ 55.1 96.2 1.5 [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Judge prompt used to evaluate accuracy of model response. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples of VSAS-BENCH task types with frame-level annotations and full prompts. VSAS-BENCH involves three task types, Present tasks, which focus on currently occurring events; Cumulative tasks, which require the model to reason over past events; and Future tasks, which focus on predicting upcoming events based on ongoing visual cues. 7 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

Streaming vision-language models (VLMs) continuously generate responses given an instruction prompt and an online stream of input frames. This is a core mechanism for real-time visual assistants. Existing VLM frameworks predominantly assess models in offline settings. In contrast, the performance of a streaming VLM depends on additional metrics beyond pure video understanding, including proactiveness, which reflects the timeliness of the model's responses, and consistency, which captures the robustness of its responses over time. To address this limitation, we propose VSAS-Bench, a new framework and benchmark for Visual Streaming Assistants. In contrast to prior benchmarks that primarily employ single-turn question answering on video inputs, VSAS-Bench features temporally dense annotations with over 18,000 annotations across diverse input domains and task types. We introduce standardized synchronous and asynchronous evaluation protocols, along with metrics that isolate and measure distinct capabilities of streaming VLMs. Using this framework, we conduct large-scale evaluations of recent video and streaming VLMs, analyzing the accuracy-latency trade-off under key design factors such as memory buffer length, memory access policy, and input resolution, yielding several practical insights. Finally, we show empirically that conventional VLMs can be adapted to streaming settings without additional training, and demonstrate that these adapted models outperform recent streaming VLMs. For example, Qwen3-VL-4B surpasses Dispider, the best streaming VLM on our benchmark, by 3% under the asynchronous protocol. The benchmark and code will be available at https://github.com/apple/ml-vsas-bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VSAS-Bench, a benchmark for real-time evaluation of visual streaming assistant (VSA) models. Unlike prior offline video QA benchmarks, it provides temporally dense annotations (>18k across domains and tasks), defines proactiveness (timeliness of responses) and consistency (robustness over time) metrics, and specifies synchronous and asynchronous evaluation protocols. Large-scale experiments on video and streaming VLMs analyze accuracy-latency trade-offs under factors such as memory buffer length, access policy, and resolution; the central empirical result is that conventional VLMs can be adapted to streaming without retraining and outperform dedicated streaming models (e.g., Qwen3-VL-4B exceeds Dispider by 3% under the asynchronous protocol).

Significance. If the proactiveness/consistency metrics and protocols prove to be valid proxies for real-time assistant utility, the benchmark fills a clear gap in streaming VLM evaluation and supplies actionable design insights. The planned public release of the benchmark and code supports reproducibility and future work in the area.

major comments (2)
  1. [Metrics and Evaluation Protocols (abstract and §4)] The claim that adapted conventional VLMs outperform dedicated streaming models (e.g., the 3% asynchronous gap) is load-bearing and rests entirely on the newly introduced proactiveness and consistency metrics together with the synchronous/asynchronous protocols. The manuscript contains no human correlation study, no ablation demonstrating that these metrics predict downstream task success in live interaction, and no comparison against alternative proxies such as end-to-end user utility; without such evidence the observed differences could be artifacts of metric construction rather than genuine streaming capability.
  2. [Experimental Results (§5)] §5 (experimental results) reports concrete outperformance numbers and accuracy-latency curves but does not provide statistical significance tests, confidence intervals, or details on data splits and annotation quality control for the >18k temporally dense labels. These omissions prevent assessment of whether the reported 3% margin and design-factor insights are robust.
minor comments (2)
  1. [Abstract] The abstract states 'over 18,000 annotations' without an exact count or per-domain/task breakdown; the main text should supply the precise figure and a table summarizing annotation distribution.
  2. [Experimental Setup] Notation for the memory buffer length, access policy, and input resolution parameters is introduced in the experimental section but would benefit from a consolidated table of symbols and default values for reader clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and constructive comments on our paper. We provide point-by-point responses to the major comments below.

read point-by-point responses
  1. Referee: [Metrics and Evaluation Protocols (abstract and §4)] The claim that adapted conventional VLMs outperform dedicated streaming models (e.g., the 3% asynchronous gap) is load-bearing and rests entirely on the newly introduced proactiveness and consistency metrics together with the synchronous/asynchronous protocols. The manuscript contains no human correlation study, no ablation demonstrating that these metrics predict downstream task success in live interaction, and no comparison against alternative proxies such as end-to-end user utility; without such evidence the observed differences could be artifacts of metric construction rather than genuine streaming capability.

    Authors: We agree that empirical validation of the new metrics against human judgments or downstream utility would provide stronger support for the benchmark's relevance. The proactiveness and consistency metrics are motivated by the fundamental requirements of streaming assistants—responding in a timely manner and maintaining coherent responses over continuous input streams—which are not captured by standard offline accuracy metrics. The synchronous and asynchronous protocols are intended to model different real-world deployment scenarios. However, conducting a full human correlation study or user study is a substantial undertaking that lies beyond the scope of this initial benchmark paper. We will revise the manuscript to include a dedicated limitations section discussing the need for future validation of these metrics and to moderate the language around the outperformance claims, framing them as observations under the proposed evaluation framework rather than definitive proof of superiority. revision: partial

  2. Referee: [Experimental Results (§5)] §5 (experimental results) reports concrete outperformance numbers and accuracy-latency curves but does not provide statistical significance tests, confidence intervals, or details on data splits and annotation quality control for the >18k temporally dense labels. These omissions prevent assessment of whether the reported 3% margin and design-factor insights are robust.

    Authors: We acknowledge these omissions and will address them in the revised manuscript. We will include statistical significance tests (e.g., paired t-tests or bootstrap methods) and confidence intervals for the key performance differences, including the reported 3% margin under the asynchronous protocol. Additionally, we will expand the description of the benchmark construction in Section 3 to provide details on data splits, annotation guidelines, and quality control procedures used for the over 18,000 temporally dense annotations. revision: yes

standing simulated objections not resolved
  • Conducting a human correlation study or ablation to validate the proactiveness and consistency metrics against real user utility, as this would require additional experiments and resources not feasible within the current revision timeline.

Circularity Check

0 steps flagged

No circularity: empirical benchmark with defined metrics and direct evaluations

full rationale

The paper introduces VSAS-Bench, new metrics (proactiveness, consistency), temporally dense annotations, and synchronous/asynchronous protocols, then reports direct empirical evaluations of existing VLMs on these. No mathematical derivations, fitted parameters repurposed as predictions, self-definitional loops, or load-bearing self-citations appear. The central claim (adapted conventional VLMs outperform dedicated streaming models) is an observation computed from the newly defined metrics applied to model outputs; it does not reduce to a fit or prior self-result by construction. External validation of the metrics (e.g., human correlation) is absent, but that is a validity concern, not circularity per the rules.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces an empirical benchmark and evaluation framework rather than any mathematical derivation; no free parameters, axioms, or invented entities are required or postulated.

pith-pipeline@v0.9.0 · 5612 in / 1182 out tokens · 62183 ms · 2026-05-10T17:33:00.984037+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 13 canonical work pages · 7 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  2. [2]

    Activitynet: A large-scale video benchmark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 2

  3. [3]

    Temporalbench: Bench- marking fine-grained temporal understanding for multimodal video models, 2024

    Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, and Jianwei Yang. Temporalbench: Bench- marking fine-grained temporal understanding for multimodal video models, 2024. 2

  4. [4]

    Videollm-online: Online video large language model for streaming video

    Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18407–18418, 2024. 1, 2, 3, 7

  5. [5]

    Livecc: Learning video llm with stream- ing speech transcription at scale

    Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. Livecc: Learning video llm with stream- ing speech transcription at scale. InCVPR, 2025. 3

  6. [6]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1

  7. [7]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever compre- hensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 2

  8. [8]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18995–19012, 2022. 1, 2

  9. [9]

    Cogvlm2: Visual language models for image and video un- derstanding

    Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Jun- hui Ji, Zhao Xue, et al. Cogvlm2: Visual language mod- els for image and video understanding.arXiv preprint arXiv:2408.16500, 2024. 2

  10. [10]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 1

  11. [11]

    Fast in- ference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast in- ference from transformers via speculative decoding. InIn- ternational Conference on Machine Learning, pages 19274– 19286. PMLR, 2023. 8

  12. [12]

    Llava-onevision: Easy visual task transfer,

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer,

  13. [13]

    Stream- ingbench: Assessing the gap for mllms to achieve streaming video understanding.CoRR, abs/2411.03628, 2024

    Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video un- derstanding.arXiv preprint arXiv:2411.03628, 2024. 1, 2

  14. [14]

    G-eval: NLG evaluation using gpt- 4 with better human alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt- 4 with better human alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing, pages 2511–2522, Singapore, 2023. Association for Computational Linguistics. 6

  15. [15]

    Ye Liu, Zongyang Ma, Zhongang Qi, Yang Wu, Chang Wen Chen, and Ying Shan. E.t. bench: Towards open-ended event-level video-language understanding. InNeural Infor- mation Processing Systems (NeurIPS), 2024. 1, 2

  16. [16]

    Nvila: Efficient frontier visual language models, 2024

    Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yux- ian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vish- wesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, and Yao Lu. Nvila:...

  17. [17]

    Egoschema: A diagnostic benchmark for very long- form video language understanding, 2023

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding, 2023. 2

  18. [18]

    SmolVLM: Redefining small and efficient multimodal models

    Andr ´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tun- stall, Leandro von Werra, and Thomas Wolf. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299...

  19. [19]

    Junbo Niu, Yifei Li, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, et al. Ovo-bench: How far is your video- llms from real-world online video understanding? InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025. 1, 2

  20. [20]

    Gpt-5 system card

    OpenAI. Gpt-5 system card. Technical Report TR-GPT5- 2025, OpenAI, 2025. Technical report. Available athttps: //cdn.openai.com/gpt- 5- system- card.pdf (accessed 2025-11-13). 3, 7, 2

  21. [21]

    Perception test: A diagnostic benchmark for multimodal video models

    Viorica P ˘atr˘aucean, Lucas Smaira, Ankush Gupta, Adri`a Re- casens Continente, Larisa Markeeva, Dylan Banarse, Skanda 9 Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osin- dero, Dima ...

  22. [22]

    Dispider: Enabling video llms with active real-time interaction via dis- entangled perception, decision, and reaction

    Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video llms with active real-time interaction via dis- entangled perception, decision, and reaction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24045–24055, 2025. 3, 7

  23. [23]

    Coin: A large-scale dataset for comprehensive instructional video analysis

    Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: A large-scale dataset for comprehensive instructional video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207– 1216, 2019. 1, 2

  24. [24]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025. 1, 2, 7

  25. [25]

    Fastvlm: Efficient vision encoding for vision language models

    Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul San- thanam, James Gabriel, Peter Grasch, Oncel Tuzel, and Hadi Pouransari. Fastvlm: Efficient vision encoding for vision language models. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),

  26. [26]

    arXiv preprint arXiv:2505.05467 , year=

    Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, and Ping Huang. Streambridge: Turning your offline video large language model into a proactive streaming assistant.arXiv preprint arXiv:2505.05467, 2025. 3, 7

  27. [27]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2

  28. [28]

    arXiv preprint arXiv:2411.10442 , year=

    Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, and Jifeng Dai. Enhancing the reason- ing ability of multimodal large language models via mixed preference optimization.arXiv preprint arXiv:2411.10442,

  29. [29]

    Star: A benchmark for situated reasoning in real-world videos

    Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. Star: A benchmark for situated reasoning in real-world videos. InThirty-fifth Conference on Neural Information Processing Systems (NeurIPS), 2021. 3

  30. [30]

    Longvideobench: A benchmark for long-context interleaved video-language understanding, 2024

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding, 2024. 2

  31. [31]

    Next-qa: Next phase of question-answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 9777–9786, 2021. 2

  32. [32]

    Msr-vtt: A large video description dataset for bridging video and language

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016. 2

  33. [33]

    arXiv preprint arXiv:2510.09608 , year=

    Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams.arXiv preprint arXiv:2510.09608, 2025. 1

  34. [34]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 2

  35. [35]

    Flash-vstream: Memory- based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024

    Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory- based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024. 1, 2, 3, 7

  36. [36]

    Towards automatic learning of procedures from web instructional videos

    Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. InProceedings of the AAAI Conference on Artificial Intelligence, 2018. 3 10 VSAS-BENCH: Real-Time Evaluation of Visual Streaming Assistant Models Supplementary Material A. Dataset Preprocessing and Annotations As detailed in Sec. 3.1, our be...

  37. [37]

    A model predicted response

  38. [38]

    No places visited yet

    A ground truth answer Your task is to compare the model’s response with the ground truth and decide if they match meaningfully. ########### Instruction on how to evaluate ########### ### Your Output Format: Return only a JSON dictionary with the following keys: * ’pred’: ”yes” if the model response meaningfully matches the ground truth, ”no” otherwise. * ...