pith. machine review for the scientific record. sign in

arxiv: 2604.24317 · v1 · submitted 2026-04-27 · 💻 cs.CV

Recognition: unknown

Don't Pause! Every prediction matters in a streaming video

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords streaming videovideo question answeringreal-time perceptiononline video modelsSPOT-BenchAsynKVdead-time
0
0 comments X

The pith

Streaming video models either spam predictions or become unresponsive when evaluated continuously rather than at fixed pauses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that existing VideoQA benchmarks pause videos at fixed timestamps and score only then, leaving untested how models behave in always-on streaming settings. It introduces SPOT-Bench with multi-turn proactive queries and the Timeliness-F1 metric to measure temporal precision and coverage across full videos. The benchmark shows offline models reliably detect events but spam unprompted predictions, silence training reduces spamming at the cost of unresponsiveness, and dead-time occupies half the video where extra compute does not increase latency. These observations motivate AsynKV, a training-free adaptation that adds long-short term memory and scales compute during dead-time. AsynKV retains offline perception accuracy while improving streaming responsiveness and reaches state-of-the-art on both new and retrospective benchmarks.

Core claim

SPOT-Bench reveals that offline models detect events reliably but spam predictions unprompted, post-training for silence reduces spamming but induces unresponsiveness, and dead-time occupies half the video where compute does not affect response latency; AsynKV retains perception while improving streaming behavior by using long-short term memory scaled during dead-time and achieves SOTA on retrospective benchmarks.

What carries the argument

AsynKV's long-short term memory that scales compute only during dead-time periods to maintain event perception without raising response latency in continuous video streams.

If this is right

  • Streaming evaluation must test continuous prediction rather than retrospective pauses to expose spamming or unresponsiveness.
  • Dead-time periods allow extra computation without increasing observed latency, enabling more accurate perception at no cost to responsiveness.
  • Training-free memory adaptations of offline models can outperform specialized streaming models on both streaming and retrospective tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-world always-on assistants would gain from explicit detection of inactivity periods to allocate compute efficiently.
  • The dead-time observation could apply to other sparse-event streaming domains such as audio or sensor streams.
  • Integrating this memory mechanism with lightweight online updates might further reduce unresponsiveness over long sessions.

Load-bearing premise

The assumption that multi-turn proactive queries in SPOT-Bench and the Timeliness-F1 metric properly capture the requirements for always-on real-time video assistants.

What would settle it

Measuring whether AsynKV's responses in a live unpaused video stream match ground-truth event timings and avoid both false positives and missed detections at rates comparable to the benchmark results.

Figures

Figures reproduced from arXiv: 2604.24317 by Angela Yao, Dibyadip Chatterjee, Fadime Sener, Yale Song, Zhanzhong Pang.

Figure 1
Figure 1. Figure 1: a) Unsolicited intervention. The assistant monitors the live stream and issues a warning when the user’s hand moves toward a dangerous region. (b) Proactive vs. retrospective queries. A retrospective query asks about what already happened, while a proactive query triggers a response at the right moment during the ongoing stream. Abstract Streaming video models should respond the moment an event unfolds, no… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SPOT-Bench tasks. Each example illustrates proactive queries expecting streaming predictions over time with red boxes marking keyframes. Tasks are grouped into detection, intervention, and interaction benchmarks. Streaming Video Large Language Models (VideoLLMs) [12, 77, 11, 49, 41, 71, 62, 56] have recently gained prominence for real-time question answering over long video streams. These model… view at source ↗
Figure 3
Figure 3. Figure 3: Streaming Evaluation Protocol. This figure illustrates the greedy matching procedure used to compute true positives (TP), false positives (FP), and false negatives (FN). TPs are predictions that are both temporally valid and semantically matched to a slot; FPs are unmatched predictions; and FNs are unfilled slots. A prediction with T > 0 that fails the semantic check is counted as an FP rather than a TP. A… view at source ↗
Figure 4
Figure 4. Figure 4: Performance across multiple turns on a subset of PNR videos view at source ↗
Figure 6
Figure 6. Figure 6: Our annotation pipeline. We first define a task taxonomy and probe curated datasets for relevant videos. After filtering according to the taxonomy, we generate ground￾truth annotations in a semi-automatic process where LLMs assist human annotators. links across them, allowing temporal and contextual continuity between adjacent turns. A human annotator then reviews all generated queries and referential link… view at source ↗
Figure 7
Figure 7. Figure 7: Additional SPOT-Bench Statistics. Overview of video durations, turn counts and our special turn types: Referential and Concurrent view at source ↗
Figure 8
Figure 8. Figure 8: Human Preference Intervals. Gold intervals de￾rived from our controlled human study with 10 MTurk an￾notators. Blue dots show annotator response times relative to 0s (Sec. 4.3). For detection tasks (PNR and ABD), gray dots mark the precise keyframe timestamps provided by the source datasets. Task MAD (s) IQR (s) Detection ABD 0.4 1.0 PNR 0.2 0.5 Interaction SQA 1.1 2.5 SPG 0.6 1.5 Intervention SI 0.5 1.0 U… view at source ↗
Figure 9
Figure 9. Figure 9: An illustrated example of the Action Boundary Detection (ABD) task. The highlighted “same” is referential, i.e. it points back to the skillet mentioned in the first query view at source ↗
Figure 10
Figure 10. Figure 10: Illustrated examples of the Point-of-No-Return Detection (PNR) task. The top example shows a multi-turn, single-response case; the bottom example shows a single-turn, multi-response case. 25 view at source ↗
Figure 11
Figure 11. Figure 11: An illustrated example of the Streaming Question Answering (SQA) task. The gap between ask time and response time is typically larger for this task view at source ↗
Figure 12
Figure 12. Figure 12: An illustrated example of the Streaming Procedural Guidance (SPG) task. A set of allowed action labels is given upfront, and the model must respond with the correct next action at the correct times as the video streams view at source ↗
Figure 13
Figure 13. Figure 13: An illustrated example of the Solicited Intervention (SI) task. The user provides a task description and a set of allowed actions, and the model issues corrective guidance only when it detects an error in the streaming video. 26 view at source ↗
Figure 14
Figure 14. Figure 14: Illustrated examples of the Unsolicited Intervention (UI) task. The top example shows a multi-response scenario, where the model issues several warnings as new hazards appear; the bottom example shows a single-response case. Prompt 1: Prompt for ABD keyframe generation Choose all applicable boundary types for the step. Reply with EXACTLY ONE of: start | end | start,end | none Meanings (short & strict) - s… view at source ↗
read the original abstract

Streaming video models should respond the moment an event unfolds, not after the moment has passed. Yet existing online VideoQA benchmarks remain largely retrospective. They pause the video at fixed timestamps, pose questions about current or past events, and score models only at those moments. This protocol leaves streaming predictions untested. To close this gap, we introduce SPOT-Bench, featuring multi-turn proactive queries that evaluate general streaming perception and assistive capabilities required by an always-on, real-time assistant. SPOT-Bench comes with Timeliness-F1, a consolidated metric that measures streaming predictions by their temporal precision and balanced coverage across the entire video. Our benchmark reveals: (i) offline models detect events reliably but spam predictions unprompted; (ii) post-training for silence reduces spamming but induces unresponsiveness; (iii) half of the streaming video expects no response, which we term dead-time - compute spent here does not affect response latency. These findings motivate AsynKV, a training-free streaming adaptation of offline models, that retains their event perception while improving their streaming behavior. AsynKV features a long-short term memory, utilized efficiently by scaling compute during dead-time. It serves as a strong baseline on SPOT-Bench, outperforming existing streaming models, and achieves state-of-the-art on retrospective benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces SPOT-Bench, a new benchmark for streaming video perception that uses multi-turn proactive queries instead of retrospective pauses, paired with the Timeliness-F1 metric that consolidates temporal precision and coverage across entire videos. It reports that offline models reliably detect events but spam unprompted predictions, that post-training for silence induces unresponsiveness, and that roughly half of streaming video is 'dead-time' where no response is expected. These observations motivate AsynKV, a training-free adaptation of offline models that employs long-short-term memory and scales compute during dead-time to improve streaming behavior while preserving perception; AsynKV is presented as a strong baseline on SPOT-Bench and SOTA on retrospective benchmarks.

Significance. If the benchmark and metric are shown to be faithful proxies for always-on real-time assistants, the work would usefully highlight practical gaps in current streaming evaluation and supply a simple, training-free baseline (AsynKV) that future methods could build upon. The explicit identification of dead-time as a compute-inefficiency factor and the provision of a new benchmark with a consolidated metric are concrete contributions that could steer the field toward more temporally aware video models.

major comments (1)
  1. Abstract: the central claims—that offline models spam predictions, silence training causes unresponsiveness, dead-time occupies ~50% of video, and AsynKV improves streaming while retaining perception—rest on SPOT-Bench's multi-turn proactive queries and Timeliness-F1 being valid proxies for always-on assistant requirements. The abstract supplies no evidence of validation via user studies, deployment logs, or correlation with downstream success metrics (e.g., user satisfaction or safety-critical latency), which is load-bearing for both the diagnostic findings and the claimed superiority of AsynKV.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the potential utility of SPOT-Bench and AsynKV. We address the major comment point by point below.

read point-by-point responses
  1. Referee: Abstract: the central claims—that offline models spam predictions, silence training causes unresponsiveness, dead-time occupies ~50% of video, and AsynKV improves streaming while retaining perception—rest on SPOT-Bench's multi-turn proactive queries and Timeliness-F1 being valid proxies for always-on assistant requirements. The abstract supplies no evidence of validation via user studies, deployment logs, or correlation with downstream success metrics (e.g., user satisfaction or safety-critical latency), which is load-bearing for both the diagnostic findings and the claimed superiority of AsynKV.

    Authors: We acknowledge that the manuscript does not include direct empirical validation such as user studies, deployment logs, or correlations with downstream metrics like user satisfaction. The SPOT-Bench design and Timeliness-F1 metric are motivated by the logical requirements of always-on streaming assistants—specifically, the need for proactive, timely responses to unfolding events without retrospective pauses or explicit per-step prompts. The multi-turn proactive queries simulate continuous interaction, while Timeliness-F1 penalizes both spamming (false positives) and unresponsiveness (misses) with temporal awareness, directly addressing streaming challenges not captured by retrospective benchmarks. The reported observations (spamming by offline models, unresponsiveness after silence training, ~50% dead-time) and AsynKV's improvements are presented as results obtained under this new protocol, with AsynKV also preserving or improving performance on standard retrospective tasks. We agree this leaves open the question of real-world correlation and view it as a limitation rather than invalidating the benchmark's diagnostic value. We will make a partial revision by clarifying these motivations and limitations in the abstract and adding a brief discussion in the conclusion on the need for future user studies. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces SPOT-Bench and Timeliness-F1 as a new benchmark and metric to address gaps in retrospective VideoQA protocols, with empirical observations on model behaviors (spamming, unresponsiveness, dead-time) motivating the training-free AsynKV adaptation. No equations, fitted parameters renamed as predictions, or self-citation chains reduce the central claims to inputs by construction. The benchmark definition, metric, and method are presented as independent contributions evaluated on both new and retrospective benchmarks, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review is based only on the abstract; no explicit free parameters, axioms, or invented entities with independent evidence are detailed beyond the introduction of the term dead-time and AsynKV.

invented entities (1)
  • dead-time no independent evidence
    purpose: Describes periods in streaming video where no response is expected from the model
    Introduced in the abstract as occupying half the video and used to motivate efficient compute scaling in AsynKV.

pith-pipeline@v0.9.0 · 5542 in / 1076 out tokens · 46154 ms · 2026-05-08T04:28:14.281473+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 32 canonical work pages · 15 internal anchors

  1. [1]

    https://openai.com/index/introducing-gpt-5/,

    Introducing GPT-5 — openai.com. https://openai.com/index/introducing-gpt-5/,

  2. [2]

    https: //www.meta.com/en-gb/emerging-tech/orion/?srsltid=AfmBOorhkbv2mqgPPW5n7LkVP iEdYDNKz3o6Av9SgEp11JsSTNC4Z2Eq, 2025

    Orion AI glasses: The future of AR glasses technology — Meta — meta.com. https: //www.meta.com/en-gb/emerging-tech/orion/?srsltid=AfmBOorhkbv2mqgPPW5n7LkVP iEdYDNKz3o6Av9SgEp11JsSTNC4Z2Eq, 2025. 1

  3. [3]

    https://www.projectaria

    Aria Gen 2, from Meta — Project Aria — projectaria.com. https://www.projectaria. com/, 2025. 1

  4. [4]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anad- kat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 4

  5. [5]

    Ht-step: Aligning instructional articles with how-to videos.Ad- vances in Neural Information Processing Systems, 36:50310–50326, 2023

    Triantafyllos Afouras, Effrosyni Mavroudi, Tushar Nagarajan, Huiyu Wang, and Lorenzo Torresani. Ht-step: Aligning instructional articles with how-to videos.Ad- vances in Neural Information Processing Systems, 36:50310–50326, 2023. 15

  6. [6]

    Hierarq: Task-aware hier- archical q-former for enhanced video understanding

    Shehreen Azad, Vibhav Vineet, and Yogesh Singh Rawat. Hierarq: Task-aware hier- archical q-former for enhanced video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8545–8556, 2025. 4

  7. [7]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025. 3, 5, 10, 11

  8. [8]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3, 4, 5, 9, 10, 11, 12, 13

  9. [9]

    Towards effective human-in-the-loop assistive ai agents

    Filippos Bellos, Yayuan Li, Cary Shu, Ruey Day, Jeffrey Siskind, and Jason Corso. Towards effective human-in-the-loop assistive ai agents. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2513–2522, 2025. 1

  10. [10]

    Aircraft accident report: Northwest airlines, inc., mcdonnell douglas dc-9–82, n312rc, detroit metropolitan wayne county airport, august 16, 1987, 1988

    National Transportation Safety Board. Aircraft accident report: Northwest airlines, inc., mcdonnell douglas dc-9–82, n312rc, detroit metropolitan wayne county airport, august 16, 1987, 1988. 1 19 SPOT-Bench Technical Report

  11. [11]

    Streaming videollms for real-time procedural video understanding

    Dibyadip Chatterjee, Edoardo Remelli, Yale Song, Bugra Tekin, Abhay Mittal, Bharat Bhatnagar, Necati Cihan Camgoz, Shreyas Hampali, Eric Sauser, Shugao Ma, et al. Streaming videollms for real-time procedural video understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22586–22598, 2025. 2, 4

  12. [12]

    Videollm-online: Online video large language model for streaming video

    Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18407–18418, 2024. 2, 4, 7

  13. [13]

    Livecc: Learning video llm with streaming speech transcription at scale

    Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. Livecc: Learning video llm with streaming speech transcription at scale. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29083–29095, 2025. 9

  14. [14]

    arXiv preprint arXiv:2504.13180 , year=

    Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, et al. Perceptionlm: Open-access data and models for detailed visual understanding.arXiv preprint arXiv:2504.13180, 2025. 3, 4, 10, 11

  15. [15]

    Arc- agi-2: A new challenge for frontier ai reasoning systems

    Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831, 2025. 15

  16. [16]

    Scaling egocentric vision: The epic-kitchens dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. InProceedings of the European conference on computer vision (ECCV), pages 720–736, 2018. 3

  17. [17]

    Online action detection

    Roeland De Geest, Efstratios Gavves, Amir Ghodrati, Zhenyang Li, Cees Snoek, and Tinne Tuytelaars. Online action detection. InEuropean Conference on Computer Vision, pages 269–284. Springer, 2016. 3, 6, 8

  18. [18]

    Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025

    Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, and Hao Jiang. Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025. 4, 9, 12

  19. [19]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024. 4, 15

  20. [20]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025. 4

  21. [21]

    Video-mme: The first- ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first- ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025. 1

  22. [22]

    Vispeak: Visual instruction feedback in streaming videos.CoRR, abs/2503.12769, 2025

    Shenghao Fu, Qize Yang, Yuan-Ming Li, Yi-Xing Peng, Kun-Yu Lin, Xihan Wei, Jian- Fang Hu, Xiaohua Xie, and Wei-Shi Zheng. Vispeak: Visual instruction feedback in streaming videos.arXiv preprint arXiv:2503.12769, 2025. 9

  23. [23]

    Anticipative video transformer

    Rohit Girdhar and Kristen Grauman. Anticipative video transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 13505–13515, 2021. 3

  24. [24]

    Future transformer for long-term action anticipation

    Dayoung Gong, Joonseok Lee, Manjin Kim, Seong Jong Ha, and Minsu Cho. Future transformer for long-term action anticipation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3052–3061, 2022. 3

  25. [25]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022. 6, 15 20 SPOT-Bench Technical Report

  26. [26]

    Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...

  27. [27]

    Ma-lmm: Memory-augmented large multimodal model for long-term video understanding

    Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13504–13514, 2024. 4

  28. [28]

    Movienet: A holistic dataset for movie understanding

    Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. Movienet: A holistic dataset for movie understanding. InEuropean conference on computer vision, pages 709–727. Springer, 2020. 15

  29. [29]

    Vinci: A real-time smart assistant based on egocentric vision-language model for portable devices.Proc

    Yifei Huang, Jilan Xu, Baoqi Pei, Lijin Yang, Mingfang Zhang, Yuping He, Guo Chen, Xinyuan Chen, Yaohui Wang, Zheng Nie, Jinyao Liu, Dechen Lin, Fang Fang, Kunpeng Li, Chang Yuan, Yu Qiao, Yali Wang, and Limin Wang. Vinci: A real-time smart assistant based on egocentric vision-language model for portable devices.Proc. ACM Interact. Mob. Wearable Ubiquitou...

  30. [30]

    Online video understanding: Ovbench and videochat-online

    Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, and Limin Wang. Online video understanding: Ovbench and videochat-online. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3328–3338, 2025. 2, 3, 7

  31. [31]

    Vid2coach: Transforming how-to videos into task assistants

    Mina Huh, Zihui Xue, Ujjaini Das, Kumar Ashutosh, Kristen Grauman, and Amy Pavel. Vid2coach: Transforming how-to videos into task assistants. InProceedings of the 38th Annual ACM Symposium on User Interface Software and T echnology, pages 1–24, 2025. 1

  32. [32]

    Open- ended hierarchical streaming video understanding with vision language models

    Hyolim Kang, Yunsu Park, Youngbeom Yoo, Yeeun Choi, and Seon Joo Kim. Open- ended hierarchical streaming video understanding with vision language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20715– 20725, 2025. 3

  33. [33]

    Human action anticipation: A survey.arXiv preprint arXiv:2410.14045, 2024

    Bolin Lai, Sam Toyer, Tushar Nagarajan, Rohit Girdhar, Shengxin Zha, James M Rehg, Kris Kitani, Kristen Grauman, Ruta Desai, and Miao Liu. Human action anticipation: A survey.arXiv preprint arXiv:2410.14045, 2024. 3

  34. [34]

    A hierarchical representation for future action prediction

    Tian Lan, Tsung-Chuan Chen, and Silvio Savarese. A hierarchical representation for future action prediction. InEuropean conference on computer vision, pages 689–704. Springer, 2014. 3

  35. [35]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1, 4

  36. [36]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024. 4

  37. [37]

    Stream- ingbench: Assessing the gap for mllms to achieve streaming video understanding.CoRR, abs/2411.03628, 2024

    Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding.arXiv preprint arXiv:2411.03628, 2024. 2, 3, 4, 5, 7, 10, 13, 15

  38. [38]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 4

  39. [39]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video- chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. 1, 4 21 SPOT-Bench Technical Report

  40. [40]

    Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023. 1

  41. [41]

    LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

    Zhenyu Ning, Guangda Liu, Qihao Jin, Wenchao Ding, Minyi Guo, and Jieru Zhao. Livevlm: Efficient online video understanding via streaming-oriented kv cache and retrieval.arXiv preprint arXiv:2505.15269, 2025. 2, 9

  42. [42]

    Ovo-bench: How far is your video-llms from real-world online video understanding? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025

    Junbo Niu, Yifei Li, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, et al. Ovo-bench: How far is your video-llms from real-world online video understanding? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025. 2, 3, 4, 5, 7, 10, 13, 15

  43. [43]

    Action points: A representation for low-latency online human action recognition.Microsoft Research Cambridge, T ech

    Sebastian Nowozin and Jamie Shotton. Action points: A representation for low-latency online human action recognition.Microsoft Research Cambridge, T ech. Rep. MSR-TR-2012- 68, 2012. 8

  44. [44]

    Introducing swe-bench verified, 2024

    OpenAI. Introducing swe-bench verified, 2024. 15

  45. [45]

    Context-enhanced memory-refined transformer for online action detection

    Zhanzhong Pang, Fadime Sener, and Angela Yao. Context-enhanced memory-refined transformer for online action detection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8700–8710, 2025. 3, 8

  46. [46]

    Parsing video events with goal inference and intent prediction

    Mingtao Pei, Yunde Jia, and Song-Chun Zhu. Parsing video events with goal inference and intent prediction. In2011 international conference on computer vision, pages 487–494. IEEE, 2011. 3

  47. [47]

    A new era of intelligence with gemini 3.Google

    Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu. A new era of intelligence with gemini 3.Google. URL: https://blog.google/products-and-platforms/products/gemini/gemini-3/,

  48. [48]

    Streaming long video understanding with large language models.Advances in Neural Information Processing Systems, 37:119336–119360, 2024

    Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models.Advances in Neural Information Processing Systems, 37:119336–119360, 2024. 3

  49. [49]

    Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction

    Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24045–24055, 2025. 2, 4, 9, 13

  50. [50]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 4

  51. [51]

    Temporal aggregate representations for long-range video understanding

    Fadime Sener, Dipika Singhania, and Angela Yao. Temporal aggregate representations for long-range video understanding. InEuropean conference on computer vision, pages 154–171. Springer, 2020. 3

  52. [52]

    Online detection of action start in untrimmed, streaming videos

    Zheng Shou, Junting Pan, Jonathan Chan, Kazuyuki Miyazawa, Hassan Mansour, Anthony Vetro, Xavier Giro-i Nieto, and Shih-Fu Chang. Online detection of action start in untrimmed, streaming videos. InProceedings of the European conference on computer vision (ECCV), pages 534–551, 2018. 8

  53. [53]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 4, 9

  54. [54]

    Ego4d goal-step: Toward hierarchical understanding of procedural activities

    Yale Song, Eugene Byrne, Tushar Nagarajan, Huiyu Wang, Miguel Martin, and Lorenzo Torresani. Ego4d goal-step: Toward hierarchical understanding of procedural activities. Advances in Neural Information Processing Systems, 36:38863–38886, 2023. 3, 15

  55. [55]

    video-salmonn-o1: Reasoning-enhanced audio-visual large language model.arXiv preprint arXiv:2502.11775, 2025

    Guangzhi Sun, Yudong Yang, Jimin Zhuang, Changli Tang, Yixuan Li, Wei Li, Zejun Ma, and Chao Zhang. video-salmonn-o1: Reasoning-enhanced audio-visual large language model.arXiv preprint arXiv:2502.11775, 2025. 4 22 SPOT-Bench Technical Report

  56. [56]

    arXiv preprint arXiv:2505.05467 , year=

    Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, and Ping Huang. Streambridge: Turning your offline video large language model into a proactive streaming assistant.arXiv preprint arXiv:2505.05467,

  57. [57]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 4

  58. [58]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 3, 10, 11

  59. [59]

    Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world

    Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, et al. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20270– 20281, 2023. 15

  60. [60]

    Mmduet2: Enhancing proactive interaction of video mllms with multi-turn reinforcement learning.arXiv preprint arXiv:2512.06810, 2025

    Yueqian Wang, Songxiang Liu, Disong Wang, Nuo Xu, Guanglu Wan, Huishuai Zhang, and Dongyan Zhao. Mmduet2: Enhancing proactive interaction of video mllms with multi-turn reinforcement learning.arXiv preprint arXiv:2512.06810, 2025. 3, 4, 9, 10

  61. [61]

    Proactivevideoqa: A comprehensive benchmark evaluating proactive interactions in video large language models.arXiv preprint arXiv:2507.09313, 2025

    Yueqian Wang, Xiaojun Meng, Yifan Wang, Huishuai Zhang, and Dongyan Zhao. Proactivevideoqa: A comprehensive benchmark evaluating proactive interactions in video large language models.arXiv preprint arXiv:2507.09313, 2025. 3, 7

  62. [62]

    Omnimmi: A comprehensive multi-modal interaction benchmark in streaming video contexts

    Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, and Zilong Zheng. Omnimmi: A comprehensive multi-modal interaction benchmark in streaming video contexts. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18925–18935, 2025. 2, 3, 7

  63. [63]

    Videollm-mod: Efficient video- language streaming with mixture-of-depths vision computation.Advances in Neural Information Processing Systems, 37:109922–109947, 2024

    Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, and Mike Zheng Shou. Videollm-mod: Efficient video- language streaming with mixture-of-depths vision computation.Advances in Neural Information Processing Systems, 37:109922–109947, 2024. 4

  64. [64]

    Egoblind: Towards egocentric visual assistance for the blind people.arXiv preprint arXiv:2503.08221, 2025

    Junbin Xiao, Nanxin Huang, Hao Qiu, Zhulin Tao, Xun Yang, Richang Hong, Meng Wang, and Angela Yao. Egoblind: Towards egocentric visual assistance for the blind people.arXiv preprint arXiv:2503.08221, 2025. 15

  65. [65]

    arXiv preprint arXiv:2501.13468 , year=

    Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, and Huchuan Lu. Streaming video understanding and multi-round interaction with memory-enhanced knowledge.arXiv preprint arXiv:2501.13468, 2025. 7

  66. [66]

    Long short-term transformer for online action detection.Advances in Neural Information Processing Systems, 34:1086–1099, 2021

    Mingze Xu, Yuanjun Xiong, Hao Chen, Xinyu Li, Wei Xia, Zhuowen Tu, and Stefano Soatto. Long short-term transformer for online action detection.Advances in Neural Information Processing Systems, 34:1086–1099, 2021. 3

  67. [67]

    arXiv preprint arXiv:2510.09608 , year=

    Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams.arXiv preprint arXiv:2510.09608, 2025. 3, 4, 9, 10, 12, 13

  68. [68]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 4

  69. [69]

    StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding

    Haolin Yang, Feilong Tang, Lingxiao Zhao, Xiang An, Ming Hu, Huifa Li, Xinlin Zhuang, Yifan Lu, Xiaofeng Zhang, Abdalla Swikir, et al. Streamagent: Towards anticipatory agents for streaming video understanding.arXiv preprint arXiv:2508.01875,

  70. [70]

    5, 9, 13 23 SPOT-Bench Technical Report

  71. [71]

    Streammem: Query-agnostic kv cache memory for stream- ing video understanding.arXiv preprint arXiv:2508.15717,

    Yanlai Yang, Zhuokai Zhao, Satya Narayan Shukla, Aashu Singh, Shlok Kumar Mishra, Lizhu Zhang, and Mengye Ren. Streammem: Query-agnostic kv cache memory for streaming video understanding.arXiv preprint arXiv:2508.15717, 2025. 4

  72. [72]

    Timechat-online: 80% visual tokens are naturally redundant in streaming videos.arXiv preprint arXiv:2504.17343, 2025

    Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, et al. Timechat-online: 80% visual tokens are naturally redundant in streaming videos.arXiv preprint arXiv:2504.17343, 2025. 2, 4, 5, 9, 13

  73. [73]

    Every moment counts: Dense detailed labeling of actions in complex videos

    Serena Yeung, Olga Russakovsky, Ning Jin, Mykhaylo Andriluka, Greg Mori, and Li Fei-Fei. Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision, 126(2):375–389, 2018. 15

  74. [74]

    arXiv preprint arXiv:2509.24871 , year=

    Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, et al. Streamforest: Efficient online video understanding with persistent event memory.arXiv preprint arXiv:2509.24871, 2025. 5, 13

  75. [75]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 4

  76. [76]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025. 3, 10, 11

  77. [77]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858, 2023. 4

  78. [78]

    Flash-vstream: Memory- based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024

    Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory-based real-time understanding for long video streams. arXiv preprint arXiv:2406.08085, 2024. 2

  79. [79]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 4

  80. [80]

    no". - If it answers the same underlying fact, mark

    Qing Zhong, Guodong Ding, and Angela Yao. Onlinetas: An online baseline for temporal action segmentation.Advances in Neural Information Processing Systems, 37: 58984–59005, 2024. 6 24 SPOT-Bench Technical Report Figure 9:An illustrated example of the Action Boundary Detection (ABD) task.The highlighted “same” isreferential,i.e. it points back to the skill...