pith. sign in

arxiv: 2606.17798 · v1 · pith:GAFPNOCBnew · submitted 2026-06-16 · 💻 cs.CV · cs.AI

LiveStarPro: Proactive Streaming Video Understanding with Hierarchical Memory for Long-Horizon Streams

Pith reviewed 2026-06-27 01:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords streaming video understandingvideo LLMshierarchical memoryproactive responselong-horizon streamsonline videoresponse timingmemory management
0
0 comments X

The pith

LiveStarPro enables proactive understanding of long video streams using verification decoding, causal masks, and hierarchical memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to create a system that can process ongoing video streams in real time, decide when to give responses without waiting for silence, maintain context over very long periods, and avoid forgetting earlier events. It does this by introducing three parts that work together: one to check when to answer using how surprised the model is, one to train the model to align video and language step by step, and one to store past information in a tree of events for quick lookup. This matters for making video-based AI assistants that can handle live, extended interactions like monitoring or commentary. The authors test it on a new set of 15 real-world scenarios that go up to hours long and report better accuracy in meaning and timing along with faster processing.

Core claim

LiveStarPro is designed for proactive video understanding over long-horizon streams through Streaming Verification Decoding that identifies response timing via single-pass perplexity verification, Streaming Causal Attention Masks that enforce incremental alignment over variable-length streams, and Tree-Structured Hierarchical Memory that organizes evicted historical information into event chains for efficient retrieval from unbounded streams, as evaluated on the OmniStarPro benchmark.

What carries the argument

Tree-Structured Hierarchical Memory (TSHM), a recursive architecture that turns evicted history into event chains to allow retrieval from effectively unbounded video streams.

If this is right

  • The model can determine appropriate response times through perplexity verification in a single pass without needing special silence tokens.
  • Training with causal attention masks ensures proper video-language alignment even as streams vary in length.
  • Historical information is organized into retrievable event chains, supporting memory over hour-scale streams.
  • The streaming key-value cache provides a 1.58 times speedup in inference compared to the model without it.
  • Performance gains include 28.9 percent better semantic correctness and 18.2 percent less timing error on the benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This design could be adapted for other continuous data streams such as audio or text conversations.
  • Applications in live event analysis or security monitoring might see direct benefits from the proactive timing and long memory.
  • Future tests could combine this memory structure with other large model architectures to check if the gains hold.
  • The benchmark's coverage of diverse scenarios suggests it could serve as a standard for evaluating online video models.

Load-bearing premise

The three components work together to fix timing, alignment, and memory problems at the same time, and the new benchmark accurately represents real long video streams.

What would settle it

Running the system on continuous live video feeds where it shows no reduction in timing errors or loses semantic accuracy over time compared to existing approaches.

Figures

Figures reproduced from arXiv: 2606.17798 by Bing Wang, Changsheng Xu, Kairui Zhang, Shengsheng Qian, Zhenyu Yang.

Figure 1
Figure 1. Figure 1: Illustration of online video understanding. (a) Taking the RNG task as an example, online video understanding requires Video-LLMs to continuously process unbounded video streams and respond only at appropriate moments. (b) Existing EOS-based methods suffer from data imbalance and temporal inconsistency, leading to unstable training and suboptimal online inference. (c)-(e) LiveStarPro establishes an effecti… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the streaming verification decoding (SVeD) inference framework: A dynamic response-silence decoding framework designed to determine optimal response timing for online video understanding. alone and overlook tasks like continuous narration or real-time grounding. Their scenario coverage is also narrow, since heavy reliance on Ego4D [28] restricts evaluation primarily to first￾person perspectives… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Streaming Causal Attention Masks (SCAM). SCAM organizes frames and captions into interleaved sequences and performs progressive per-time-step training, masking preceding captions within each semantic clip to align training with streaming inference. 1) Streaming Video-Language Alignment: Existing Video￾LLMs generally build upon foundation models pre-trained on static image-text pairs [1], [2], [… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of Tree-Structured Hierarchical Memory (TSHM). (a) Short-term frames are compressed via Peak-End rule, with evicted units offloaded to long-term storage. (b) The Recursive Event Tree organizes units by attaching similar events as children (Sim ≥ τ) or creating new branches. (c) Context-aware retrieval fetches relevant event chains to augment generation. distills the active context window, designat… view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the pipeline of a rigorous multi-stage process. Steps (1)-(3) involve data collection and preprocessing, and steps (4)-(6) involve constructing an online task dataset, using the OmniStarPro-RNG task as an example. Other online tasks are constructed in a similar manner. recognition system under a strict lexical density constraint of at most two words per 10-second interval, since the evaluated V… view at source ↗
Figure 6
Figure 6. Figure 6: Distributions of video data. (a) Distribution of video categories across 15 real-world scenarios. (b) Duration distribution of the OmniStarPro-Live partition at the second level. (c) Duration distribution of the OmniStarPro-Long partition at the minute level. InternLM2.5-7B [93] language model. InternViT extracts video frame embeddings at 1-4 FPS, with each frame repre￾sented by 16 tokens. For efficiency, … view at source ↗
Figure 7
Figure 7. Figure 7: Ablation study on the impact of response-silence threshold. is achieved within a narrow interval of α = 1.02–1.04, and we select α = 1.03 as the default setting. The narrowness of this optimal range reflects a well-understood property of perplexity-based thresholds: because perplexity is computed relative to the LM’s own probability distribution, its absolute scale varies across model families and domains,… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison on the RNG task. LiveStarPro is timely and precise, while VideoLLM-online is repetitive and MMDuet often misses key points. hour-long streams. On the OmniStarPro-Long partition, LiveS￾tarPro further sustains reliable recall across all three memory￾centric tasks (long-range memory recall, cross-event difference query, and temporal backtracking), confirming that TSHM effectively mitigates catastro… view at source ↗
read the original abstract

Despite the remarkable progress of Video Large Language Models (Video-LLMs), current online architectures still struggle to simultaneously process continuous video streams, decide autonomously when to respond, and preserve long-horizon contextual memory. These obstacles undermine real-time responsiveness and cause severe forgetting throughout prolonged interactions. In this work, we introduce LiveStarPro, a live streaming assistant that is designed for proactive video understanding over long-horizon streams. The design of LiveStarPro rests on three complementary components. The first component is Streaming Verification Decoding (SVeD), an inference framework that identifies the appropriate response timing through single-pass perplexity verification, thereby eliminating the dependency on explicit silence tokens. The second component is Streaming Causal Attention Masks (SCAM), a training strategy that enforces incremental video-language alignment over variable-length streams. The third component is Tree-Structured Hierarchical Memory (TSHM), a recursive memory architecture that organizes evicted historical information into event chains and consequently enables efficient retrieval from effectively unbounded video streams. To facilitate a comprehensive evaluation under realistic online conditions, we further present OmniStarPro, a large-scale benchmark that spans 15 diverse real-world scenarios and that extends to hour-scale streams for the assessment of long-term recall. Extensive experiments demonstrate that LiveStarPro consistently surpasses existing methods, attaining a 28.9% improvement in semantic correctness and an 18.2% reduction in timing error, while its streaming key-value cache further yields a 1.58x inference speedup over the same model without caching. The model and the code are publicly available at https://github.com/sotayang/LiveStarPro.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces LiveStarPro, a proactive streaming video understanding system for long-horizon streams. It consists of three components: Streaming Verification Decoding (SVeD) to determine response timing via single-pass perplexity verification without silence tokens; Streaming Causal Attention Masks (SCAM) as a training strategy for incremental video-language alignment; and Tree-Structured Hierarchical Memory (TSHM) to recursively organize evicted frames into event chains for efficient retrieval. The work also presents the OmniStarPro benchmark spanning 15 scenarios and hour-scale streams. Experiments report 28.9% gains in semantic correctness, 18.2% reduction in timing error, and 1.58x speedup from the streaming KV cache, with code and model released publicly.

Significance. If the empirical gains and long-horizon claims hold under rigorous evaluation, the work would address key open problems in online Video-LLMs (autonomous response timing, incremental alignment, and scalable memory) with a concrete system and benchmark. The public code release is a clear strength for reproducibility and follow-on work.

major comments (1)
  1. [Abstract (TSHM description)] Abstract (TSHM description): the central claim that TSHM enables 'efficient retrieval from effectively unbounded video streams' is load-bearing for the long-horizon results (28.9% semantic gain), yet no scaling analysis, query complexity bounds, tree depth, eviction policy, or interaction with the streaming KV cache is provided. Without these, it remains possible that retrieval cost grows linearly with stream length on the hour-scale OmniStarPro streams, undermining the 'proactive streaming over long-horizon' contribution.
minor comments (1)
  1. [Abstract] The abstract states that 'the model and the code are publicly available at https://github.com/sotayang/LiveStarPro' but provides no commit hash, license, or reproducibility checklist; this should be expanded in the final version.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the TSHM description. We address the point below and will strengthen the manuscript with additional analysis.

read point-by-point responses
  1. Referee: [Abstract (TSHM description)] Abstract (TSHM description): the central claim that TSHM enables 'efficient retrieval from effectively unbounded video streams' is load-bearing for the long-horizon results (28.9% semantic gain), yet no scaling analysis, query complexity bounds, tree depth, eviction policy, or interaction with the streaming KV cache is provided. Without these, it remains possible that retrieval cost grows linearly with stream length on the hour-scale OmniStarPro streams, undermining the 'proactive streaming over long-horizon' contribution.

    Authors: We agree that the current version lacks explicit scaling analysis, complexity bounds, tree depth characterization, eviction policy details, and KV-cache interaction for TSHM. This is a valid observation. In the revision we will insert a dedicated subsection (likely in Section 3.3 or 4) that (i) derives the O(log N) query complexity arising from the recursive event-chain tree, (ii) reports observed tree depths on the hour-scale OmniStarPro streams (typically 4–6 levels), (iii) specifies the eviction policy based on event-chain importance scores, and (iv) explains how TSHM retrieval is fused with the streaming KV cache to avoid linear cost. These additions will directly support the long-horizon claims. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive system architecture with no derivations or self-referential fits

full rationale

The provided abstract and description contain no equations, derivations, fitted parameters presented as predictions, or self-citation chains that bear the central claims. The three components (SVeD, SCAM, TSHM) are introduced as design choices whose performance is evaluated empirically on the OmniStarPro benchmark; the reported gains (28.9% semantic correctness, 18.2% timing error reduction, 1.58x speedup) are external measurements rather than quantities forced by construction from the inputs. No load-bearing step reduces to a self-definition or renamed known result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no information is provided on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5833 in / 1277 out tokens · 53832 ms · 2026-06-27T01:10:02.028180+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

100 extracted references · 25 linked inside Pith

  1. [1]

    Expanding performance boundaries of open- source multimodal models with model, data, and test-time scaling,

    Z. Chen, W. Wang, Y . Cao, Y . Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liuet al., “Expanding performance boundaries of open- source multimodal models with model, data, and test-time scaling,” arXiv preprint arXiv:2412.05271, 2024

  2. [2]

    Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,

    P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y . Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin, “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

  3. [3]

    Minicpm-v: A gpt-4v level mllm on your phone,

    Y . Yao, T. Yu, A. Zhang, C. Wang, J. Cui, H. Zhu, T. Cai, H. Li, W. Zhao, Z. Heet al., “Minicpm-v: A gpt-4v level mllm on your phone,”arXiv preprint arXiv:2408.01800, 2024

  4. [4]

    Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output,

    P. Zhang, X. Dong, Y . Zang, Y . Cao, R. Qian, L. Chen, Q. Guo, H. Duan, B. Wang, L. Ouyanget al., “Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output,” arXiv preprint arXiv:2407.03320, 2024

  5. [5]

    Chatglm: A family of large language models from glm-130b to glm-4 all tools,

    T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Zhang, D. Rojas, G. Feng, H. Zhaoet al., “Chatglm: A family of large language models from glm-130b to glm-4 all tools,”arXiv preprint arXiv:2406.12793, 2024

  6. [6]

    Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens,

    K. Ataallah, X. Shen, E. Abdelrahman, E. Sleiman, D. Zhu, J. Ding, and M. Elhoseiny, “Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens,”arXiv preprint arXiv:2404.03413, 2024

  7. [7]

    Video-chatgpt: Towards detailed video understanding via large vision and language models,

    M. Maaz, H. Rasheed, S. Khan, and F. S. Khan, “Video-chatgpt: Towards detailed video understanding via large vision and language models,” arXiv preprint arXiv:2306.05424, 2023

  8. [8]

    Videochat: Chat-centric video understanding,

    K. Li, Y . He, Y . Wang, Y . Li, W. Wang, P. Luo, Y . Wang, L. Wang, and Y . Qiao, “Videochat: Chat-centric video understanding,”arXiv preprint arXiv:2305.06355, 2023

  9. [9]

    Vid2seq: Large-scale pretraining of a visual language model for dense video captioning,

    A. Yang, A. Nagrani, P. H. Seo, A. Miech, J. Pont-Tuset, I. Laptev, J. Sivic, and C. Schmid, “Vid2seq: Large-scale pretraining of a visual language model for dense video captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 714–10 726

  10. [10]

    Internvideo: General video foundation models via gen- erative and discriminative learning,

    Y . Wang, K. Li, Y . Li, Y . He, B. Huang, Z. Zhao, H. Zhang, J. Xu, Y . Liu, Z. Wanget al., “Internvideo: General video foundation models via gen- erative and discriminative learning,”arXiv preprint arXiv:2212.03191, 2022

  11. [11]

    Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms,

    Z. Cheng, S. Leng, H. Zhang, Y . Xin, X. Li, G. Chen, Y . Zhu, W. Zhang, Z. Luo, D. Zhaoet al., “Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms,”arXiv preprint arXiv:2406.07476, 2024

  12. [12]

    Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution,

    Z. Liu, Y . Dong, Z. Liu, W. Hu, J. Lu, and Y . Rao, “Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution,”arXiv preprint arXiv:2409.12961, 2024

  13. [13]

    Timechat: A time-sensitive multimodal large language model for long video understanding,

    S. Ren, L. Yao, S. Li, X. Sun, and L. Hou, “Timechat: A time-sensitive multimodal large language model for long video understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 313–14 323

  14. [14]

    Flash-vstream: Memory-based real-time understanding for long video streams,

    H. Zhang, Y . Wang, Y . Tang, Y . Liu, J. Feng, J. Dai, and X. Jin, “Flash-vstream: Memory-based real-time understanding for long video streams,”arXiv preprint arXiv:2406.08085, 2024

  15. [15]

    Moviechat+: Question-aware sparse memory for long video question answering,

    E. Song, W. Chai, T. Ye, J.-N. Hwang, X. Li, and G. Wang, “Moviechat+: Question-aware sparse memory for long video question answering,” arXiv preprint arXiv:2404.17176, 2024

  16. [16]

    Ma-lmm: Memory-augmented large multimodal model for long-term video understanding,

    B. He, H. Li, Y . K. Jang, M. Jia, X. Cao, A. Shah, A. Shrivastava, and S.-N. Lim, “Ma-lmm: Memory-augmented large multimodal model for long-term video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 504–13 514

  17. [17]

    Longllava: Scaling multi-modal llms to 1000 images efficiently via hybrid architecture,

    X. Wang, D. Song, S. Chen, C. Zhang, and B. Wang, “Longllava: Scaling multi-modal llms to 1000 images efficiently via hybrid architecture,”

  18. [18]

    Available: https://arxiv.org/abs/2409.02889

    [Online]. Available: https://arxiv.org/abs/2409.02889

  19. [19]

    Longvila: Scaling long-context visual language models for long videos,

    F. Xue, Y . Chen, D. Li, Q. Hu, L. Zhu, X. Li, Y . Fang, H. Tang, S. Yang, Z. Liu, Y . He, H. Yin, P. Molchanov, J. Kautz, L. Fan, Y . Zhu, Y . Lu, and S. Han, “Longvila: Scaling long-context visual language models for long videos,”null, 2024

  20. [20]

    Long context transfer from language to vision,

    P. Zhang, K. Zhang, B. Li, G. Zeng, J. Yang, Y . Zhang, Z. Wang, H. Tan, C. Li, and Z. Liu, “Long context transfer from language to vision,”arXiv preprint arXiv:2406.16852, 2024

  21. [21]

    Videollm-online: Online video large language model for streaming video,

    J. Chen, Z. Lv, S. Wu, K. Q. Lin, C. Song, D. Gao, J.-W. Liu, Z. Gao, D. Mao, and M. Z. Shou, “Videollm-online: Online video large language model for streaming video,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 407–18 418. PREPRINT, 2026 17

  22. [22]

    Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation,

    S. Wu, J. Chen, K. Q. Lin, Q. Wang, Y . Gao, Q. Xu, T. Xu, Y . Hu, E. Chen, and M. Z. Shou, “Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation,”Advances in Neural Information Processing Systems, vol. 37, pp. 109 922–109 947, 2024

  23. [23]

    Lion-fs: Fast & slow video-language thinker as online video assistant,

    W. Li, B. Hu, R. Shao, L. Shen, and L. Nie, “Lion-fs: Fast & slow video-language thinker as online video assistant,”arXiv preprint arXiv:2503.03663, 2025

  24. [24]

    Stream- mind: Unlocking full frame rate streaming video dialogue through event- gated cognition,

    X. Ding, H. Wu, Y . Yang, S. Jiang, D. Bai, Z. Chen, and T. Cao, “Stream- mind: Unlocking full frame rate streaming video dialogue through event- gated cognition,”arXiv preprint arXiv:2503.06220, 2025

  25. [25]

    Videollm knows when to speak: Enhancing time-sensitive video com- prehension with video-text duet interaction format,

    Y . Wang, X. Meng, Y . Wang, J. Liang, J. Wei, H. Zhang, and D. Zhao, “Videollm knows when to speak: Enhancing time-sensitive video com- prehension with video-text duet interaction format,”arXiv preprint arXiv:2411.17991, 2024

  26. [26]

    Streaming long video understanding with large language models,

    R. Qian, X. Dong, P. Zhang, Y . Zang, S. Ding, D. Lin, and J. Wang, “Streaming long video understanding with large language models,”Ad- vances in Neural Information Processing Systems, vol. 37, pp. 119 336– 119 360, 2024

  27. [27]

    Streaming dense video captioning,

    X. Zhou, A. Arnab, S. Buch, S. Yan, A. Myers, X. Xiong, A. Nagrani, and C. Schmid, “Streaming dense video captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 243–18 252

  28. [28]

    Streaming video understanding and multi-round interaction with memory-enhanced knowledge,

    H. Xiong, Z. Yang, J. Yu, Y . Zhuge, L. Zhang, J. Zhu, and H. Lu, “Streaming video understanding and multi-round interaction with memory-enhanced knowledge,”arXiv preprint arXiv:2501.13468, 2025

  29. [29]

    Ego4d: Around the world in 3,000 hours of egocentric video,

    K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liuet al., “Ego4d: Around the world in 3,000 hours of egocentric video,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 995–19 012

  30. [30]

    Soccernet: A scalable dataset for action spotting in soccer videos,

    S. Giancola, M. Amine, T. Dghaily, and B. Ghanem, “Soccernet: A scalable dataset for action spotting in soccer videos,” inProceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 1711–1721

  31. [31]

    Svbench: A benchmark with temporal multi-turn dialogues for streaming video understanding,

    Z. Yang, Y . Hu, Z. Du, D. Xue, S. Qian, J. Wu, F. Yang, W. Dong, and C. Xu, “Svbench: A benchmark with temporal multi-turn dialogues for streaming video understanding,”arXiv preprint arXiv:2502.10810, 2025

  32. [32]

    Ovo-bench: How far is your video- llms from real-world online video understanding?

    Y . Li, J. Niu, Z. Miao, C. Ge, Y . Zhou, Q. He, X. Dong, H. Duan, S. Ding, R. Qianet al., “Ovo-bench: How far is your video- llms from real-world online video understanding?”arXiv preprint arXiv:2501.05510, 2025

  33. [33]

    Livestar: Live streaming assistant for real-world online video understanding,

    Z. Yang, K. Zhang, Y . Hu, B. Wang, S. Qian, B. Wen, F. Yang, T. Gao, W. Dong, and C. Xu, “Livestar: Live streaming assistant for real-world online video understanding,”Advances in Neural Information Processing Systems, vol. 38, pp. 31 266–31 304, 2026

  34. [34]

    Llama 2: Open foundation and fine-tuned chat models,

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

  35. [35]

    Gemini: a family of highly capable multimodal models,

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millicanet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

  36. [36]

    Gpt-4 technical report,

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  37. [37]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

  38. [38]

    Improving language understanding by generative pre-training,

    A. Radford, K. Narasimhan, T. Salimans, I. Sutskeveret al., “Improving language understanding by generative pre-training,”Unknown, 2018

  39. [39]

    Video dataflywheel: Resolving the impossible data trinity in video-language understanding,

    X. Wang, J. Wu, Z. Lin, F. Zhang, D. Zhang, and L. Nie, “Video dataflywheel: Resolving the impossible data trinity in video-language understanding,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  40. [40]

    Object-centric rep- resentation learning for video scene understanding,

    Y . Zhou, H. Zhang, S.-I. Park, B. Yoo, and X. Qi, “Object-centric rep- resentation learning for video scene understanding,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 8410– 8423, 2024

  41. [41]

    Sharegpt4video: Improving video understanding and generation with better captions,

    L. Chen, X. Wei, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, Z. Tang, L. Yuanet al., “Sharegpt4video: Improving video understanding and generation with better captions,”Advances in Neural Information Processing Systems, vol. 37, pp. 19 472–19 495, 2024

  42. [42]

    Pllava: Parameter-free llava extension from images to videos for video dense captioning,

    L. Xu, Y . Zhao, D. Zhou, Z. Lin, S. K. Ng, and J. Feng, “Pllava: Parameter-free llava extension from images to videos for video dense captioning,”arXiv preprint arXiv:2404.16994, 2024

  43. [43]

    Video recap: Recursive captioning of hour-long videos,

    M. M. Islam, N. Ho, X. Yang, T. Nagarajan, L. Torresani, and G. Bertasius, “Video recap: Recursive captioning of hour-long videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 198–18 208

  44. [44]

    Large language models are temporal and causal reasoners for video question answering,

    D. Ko, J. S. Lee, W. Kang, B. Roh, and H. J. Kim, “Large language models are temporal and causal reasoners for video question answering,” arXiv preprint arXiv:2310.15747, 2023

  45. [45]

    Mvbench: A comprehensive multi-modal video understanding benchmark,

    K. Li, Y . Wang, Y . He, Y . Li, Y . Wang, Y . Liu, Z. Wang, J. Xu, G. Chen, P. Luoet al., “Mvbench: A comprehensive multi-modal video understanding benchmark,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 195–22 206

  46. [46]

    Videogpt+: Integrating image and video encoders for enhanced video understanding,

    M. Maaz, H. Rasheed, S. Khan, and F. Khan, “Videogpt+: Integrating image and video encoders for enhanced video understanding,”arXiv preprint arXiv:2406.09418, 2024

  47. [47]

    Learning to answer visual questions from web videos,

    A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid, “Learning to answer visual questions from web videos,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 5, pp. 3202–3218, 2025

  48. [48]

    Transformer-empowered invariant grounding for video question answering,

    Y . Li, X. Wang, J. Xiao, W. Ji, and T.-S. Chua, “Transformer-empowered invariant grounding for video question answering,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 11, pp. 9510– 9522, 2025

  49. [49]

    Intentqa: Intent question answering in videos by cognitive context reasoning,

    J. Li, P. Wei, W. Han, S.-C. Zhu, and L. Fan, “Intentqa: Intent question answering in videos by cognitive context reasoning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–18, 2026

  50. [50]

    Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding,

    Y . Guo, J. Liu, M. Li, D. Cheng, X. Tang, D. Sui, Q. Liu, X. Chen, and K. Zhao, “Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 3, 2025, pp. 3302– 3310

  51. [51]

    Vtg-gpt: Tuning-free zero- shot video temporal grounding with gpt,

    Y . Xu, Y . Sun, Z. Xie, B. Zhai, and S. Du, “Vtg-gpt: Tuning-free zero- shot video temporal grounding with gpt,”Applied Sciences, vol. 14, no. 5, p. 1894, 2024

  52. [52]

    Hawkeye: Training video-text llms for grounding text in videos,

    Y . Wang, X. Meng, J. Liang, Y . Wang, Q. Liu, and D. Zhao, “Hawkeye: Training video-text llms for grounding text in videos,”arXiv preprint arXiv:2403.10228, 2024

  53. [53]

    Llava-next: A strong zero-shot video understanding model,

    Y . Zhang, B. Li, h. Liu, Y . j. Lee, L. Gui, D. Fu, J. Feng, Z. Liu, and C. Li, “Llava-next: A strong zero-shot video understanding model,” April 2024. [Online]. Available: https://llava-vl.github.io/blog/2024-04- 30-llava-next-video/

  54. [54]

    Video-llava: Learning united visual representation by alignment before projection,

    B. Lin, B. Zhu, Y . Ye, M. Ning, P. Jin, and L. Yuan, “Video-llava: Learning united visual representation by alignment before projection,” arXiv preprint arXiv:2311.10122, 2023

  55. [55]

    Vila: On pre-training for visual language models,

    J. Lin, H. Yin, W. Ping, P. Molchanov, M. Shoeybi, and S. Han, “Vila: On pre-training for visual language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 689–26 699

  56. [56]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,

    M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieseret al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024

  57. [57]

    Valor: Vision-audio-language omni-perception pretraining model and dataset,

    J. Liu, S. Chen, X. He, L. Guo, X. Zhu, W. Wang, and J. Tang, “Valor: Vision-audio-language omni-perception pretraining model and dataset,”IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. 47, no. 2, pp. 708–724, 2025

  58. [58]

    Cap4video++: Enhancing video understanding with auxiliary captions,

    W. Wu, X. Wang, H. Luo, J. Wang, Y . Yang, and W. Ouyang, “Cap4video++: Enhancing video understanding with auxiliary captions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 7, pp. 5223–5237, 2025

  59. [59]

    Hierarchical banzhaf interaction for general video-language representation learning,

    P. Jin, H. Li, L. Yuan, S. Yan, and J. Chen, “Hierarchical banzhaf interaction for general video-language representation learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 3, pp. 2125–2139, 2025

  60. [60]

    Video dataflywheel: Resolving the impossible data trinity in video-language understanding,

    X. Wang, J. Wu, Z. Lin, F. Zhang, D. Zhang, and L. Nie, “Video dataflywheel: Resolving the impossible data trinity in video-language understanding,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 4, pp. 2912–2923, 2025

  61. [61]

    Livechat: A large- scale personalized dialogue dataset automatically constructed from live streaming,

    J. Gao, Y . Lian, Z. Zhou, Y . Fu, and B. Wang, “Livechat: A large- scale personalized dialogue dataset automatically constructed from live streaming,”arXiv preprint arXiv:2306.08401, 2023

  62. [62]

    Don’t pause: Streaming video-language synchrony for online video understanding,

    Z. Yang, K. Zhang, S. Qian, W. Dong, and C. Xu, “Don’t pause: Streaming video-language synchrony for online video understanding,” arXiv preprint arXiv:2606.06991, 2026

  63. [63]

    Querystream: Advancing streaming video understanding with query-aware pruning PREPRINT, 2026 18 and proactive response,

    K. Zhang, Z. Yang, B. Wang, S. Qian, and C. Xu, “Querystream: Advancing streaming video understanding with query-aware pruning PREPRINT, 2026 18 and proactive response,” inThe Fourteenth International Conference on Learning Representations, 2026

  64. [64]

    Languagebind: Extending video-language pre- training to n-modality by language-based semantic alignment,

    B. Zhu, B. Lin, M. Ning, Y . Yan, J. Cui, H. Wang, Y . Pang, W. Jiang, J. Zhang, Z. Liet al., “Languagebind: Extending video-language pre- training to n-modality by language-based semantic alignment,”arXiv preprint arXiv:2310.01852, 2023

  65. [65]

    Egoschema: A diagnostic benchmark for very long-form video language understanding,

    K. Mangalam, R. Akshulakov, J. Maliket al., “Egoschema: A diagnostic benchmark for very long-form video language understanding,”Advances in Neural Information Processing Systems, vol. 36, 2024

  66. [66]

    Activitynet- qa: A dataset for understanding complex web videos via question answering,

    Z. Yu, D. Xu, J. Yu, T. Yu, Z. Zhao, Y . Zhuang, and D. Tao, “Activitynet- qa: A dataset for understanding complex web videos via question answering,” inProceedings of the AAAI Conference on Artificial In- telligence, vol. 33, no. 01, 2019, pp. 9127–9134

  67. [67]

    Hero: Hier- archical encoder for video+ language omni-representation pre-training,

    L. Li, Y .-C. Chen, Y . Cheng, Z. Gan, L. Yu, and J. Liu, “Hero: Hier- archical encoder for video+ language omni-representation pre-training,” arXiv preprint arXiv:2005.00200, 2020

  68. [68]

    Per- ception test: A diagnostic benchmark for multimodal video models,

    V . Patraucean, L. Smaira, A. Gupta, A. Recasens, L. Markeeva, D. Ba- narse, S. Koppula, M. Malinowski, Y . Yang, C. Doerschet al., “Per- ception test: A diagnostic benchmark for multimodal video models,” Advances in Neural Information Processing Systems, vol. 36, 2024

  69. [69]

    Social- iq: A question answering benchmark for artificial social intelligence,

    A. Zadeh, M. Chan, P. P. Liang, E. Tong, and L.-P. Morency, “Social- iq: A question answering benchmark for artificial social intelligence,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8807–8817

  70. [70]

    Video question answering via gradually refined attention over appearance and motion,

    D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y . Zhuang, “Video question answering via gradually refined attention over appearance and motion,” inProceedings of the 25th ACM international conference on Multimedia, 2017, pp. 1645–1653

  71. [71]

    Tvqa: Localized, compo- sitional video question answering,

    J. Lei, L. Yu, M. Bansal, and T. L. Berg, “Tvqa: Localized, compo- sitional video question answering,”arXiv preprint arXiv:1809.01696, 2018

  72. [72]

    Next-qa: Next phase of question-answering to explaining temporal actions,

    J. Xiao, X. Shang, A. Yao, and T.-S. Chua, “Next-qa: Next phase of question-answering to explaining temporal actions,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 9777–9786

  73. [73]

    Moviechat: From dense token to sparse memory for long video understanding,

    E. Song, W. Chai, G. Wang, Y . Zhang, H. Zhou, F. Wu, H. Chi, X. Guo, T. Ye, Y . Zhanget al., “Moviechat: From dense token to sparse memory for long video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 221–18 232

  74. [74]

    Lvbench: An extreme long video understanding benchmark,

    W. Wang, Z. He, W. Hong, Y . Cheng, X. Zhang, J. Qi, S. Huang, B. Xu, Y . Dong, M. Dinget al., “Lvbench: An extreme long video understanding benchmark,”arXiv preprint arXiv:2406.08035, 2024

  75. [75]

    Tgif-qa: Toward spatio- temporal reasoning in visual question answering,

    Y . Jang, Y . Song, Y . Yu, Y . Kim, and G. Kim, “Tgif-qa: Toward spatio- temporal reasoning in visual question answering,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2758–2766

  76. [76]

    Moviechat+: Question-aware sparse memory for long video question answering,

    E. Song, W. Chai, T. Ye, J.-N. Hwang, X. Li, and G. Wang, “Moviechat+: Question-aware sparse memory for long video question answering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 1, pp. 374–389, 2026

  77. [77]

    Momentor++: Advancing video large language models with fine-grained long video reasoning,

    J. Li, M. Gao, X. He, S. Tang, W.-S. Zheng, J. Xiao, M. Wang, T.-S. Chua, and Y . Zhuang, “Momentor++: Advancing video large language models with fine-grained long video reasoning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 6, pp. 6208– 6224, 2026

  78. [78]

    Selongvlm: Empowering long video language models with self-corrective clip selection,

    K. Zhang, Z. Yang, M. Han, Y . Zhuge, H. Hao, C. Li, Z. Li, and X. Chang, “Selongvlm: Empowering long video language models with self-corrective clip selection,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–16, 2026

  79. [79]

    Ego-r1: Agentic chain-of-tool-thought for ultra- long egocentric video reasoning,

    S. Tian, R. Wang, H. Guo, P. Wu, Y . Dong, X. Wang, J. Yang, H. Zhang, H. Zhu, and Z. Liu, “Ego-r1: Agentic chain-of-tool-thought for ultra- long egocentric video reasoning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–16, 2026

  80. [80]

    Hier-egopack: Hierarchical egocentric video understanding with diverse task perspectives,

    S. A. Peirone, F. Pistilli, A. Alliegro, T. Tommasi, and G. Averta, “Hier-egopack: Hierarchical egocentric video understanding with diverse task perspectives,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 2, pp. 1917–1931, 2026

Showing first 80 references.