LiveStarPro: Proactive Streaming Video Understanding with Hierarchical Memory for Long-Horizon Streams

Bing Wang; Changsheng Xu; Kairui Zhang; Shengsheng Qian; Zhenyu Yang

arxiv: 2606.17798 · v1 · pith:GAFPNOCBnew · submitted 2026-06-16 · 💻 cs.CV · cs.AI

LiveStarPro: Proactive Streaming Video Understanding with Hierarchical Memory for Long-Horizon Streams

Zhenyu Yang , Kairui Zhang , Bing Wang , Shengsheng Qian , Changsheng Xu This is my paper

Pith reviewed 2026-06-27 01:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords streaming video understandingvideo LLMshierarchical memoryproactive responselong-horizon streamsonline videoresponse timingmemory management

0 comments

The pith

LiveStarPro enables proactive understanding of long video streams using verification decoding, causal masks, and hierarchical memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to create a system that can process ongoing video streams in real time, decide when to give responses without waiting for silence, maintain context over very long periods, and avoid forgetting earlier events. It does this by introducing three parts that work together: one to check when to answer using how surprised the model is, one to train the model to align video and language step by step, and one to store past information in a tree of events for quick lookup. This matters for making video-based AI assistants that can handle live, extended interactions like monitoring or commentary. The authors test it on a new set of 15 real-world scenarios that go up to hours long and report better accuracy in meaning and timing along with faster processing.

Core claim

LiveStarPro is designed for proactive video understanding over long-horizon streams through Streaming Verification Decoding that identifies response timing via single-pass perplexity verification, Streaming Causal Attention Masks that enforce incremental alignment over variable-length streams, and Tree-Structured Hierarchical Memory that organizes evicted historical information into event chains for efficient retrieval from unbounded streams, as evaluated on the OmniStarPro benchmark.

What carries the argument

Tree-Structured Hierarchical Memory (TSHM), a recursive architecture that turns evicted history into event chains to allow retrieval from effectively unbounded video streams.

If this is right

The model can determine appropriate response times through perplexity verification in a single pass without needing special silence tokens.
Training with causal attention masks ensures proper video-language alignment even as streams vary in length.
Historical information is organized into retrievable event chains, supporting memory over hour-scale streams.
The streaming key-value cache provides a 1.58 times speedup in inference compared to the model without it.
Performance gains include 28.9 percent better semantic correctness and 18.2 percent less timing error on the benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This design could be adapted for other continuous data streams such as audio or text conversations.
Applications in live event analysis or security monitoring might see direct benefits from the proactive timing and long memory.
Future tests could combine this memory structure with other large model architectures to check if the gains hold.
The benchmark's coverage of diverse scenarios suggests it could serve as a standard for evaluating online video models.

Load-bearing premise

The three components work together to fix timing, alignment, and memory problems at the same time, and the new benchmark accurately represents real long video streams.

What would settle it

Running the system on continuous live video feeds where it shows no reduction in timing errors or loses semantic accuracy over time compared to existing approaches.

Figures

Figures reproduced from arXiv: 2606.17798 by Bing Wang, Changsheng Xu, Kairui Zhang, Shengsheng Qian, Zhenyu Yang.

**Figure 1.** Figure 1: Illustration of online video understanding. (a) Taking the RNG task as an example, online video understanding requires Video-LLMs to continuously process unbounded video streams and respond only at appropriate moments. (b) Existing EOS-based methods suffer from data imbalance and temporal inconsistency, leading to unstable training and suboptimal online inference. (c)-(e) LiveStarPro establishes an effecti… view at source ↗

**Figure 2.** Figure 2: Overview of the streaming verification decoding (SVeD) inference framework: A dynamic response-silence decoding framework designed to determine optimal response timing for online video understanding. alone and overlook tasks like continuous narration or real-time grounding. Their scenario coverage is also narrow, since heavy reliance on Ego4D [28] restricts evaluation primarily to firstperson perspectives… view at source ↗

**Figure 3.** Figure 3: Overview of Streaming Causal Attention Masks (SCAM). SCAM organizes frames and captions into interleaved sequences and performs progressive per-time-step training, masking preceding captions within each semantic clip to align training with streaming inference. 1) Streaming Video-Language Alignment: Existing VideoLLMs generally build upon foundation models pre-trained on static image-text pairs [1], [2], [… view at source ↗

**Figure 4.** Figure 4: Overview of Tree-Structured Hierarchical Memory (TSHM). (a) Short-term frames are compressed via Peak-End rule, with evicted units offloaded to long-term storage. (b) The Recursive Event Tree organizes units by attaching similar events as children (Sim ≥ τ) or creating new branches. (c) Context-aware retrieval fetches relevant event chains to augment generation. distills the active context window, designat… view at source ↗

**Figure 5.** Figure 5: Overview of the pipeline of a rigorous multi-stage process. Steps (1)-(3) involve data collection and preprocessing, and steps (4)-(6) involve constructing an online task dataset, using the OmniStarPro-RNG task as an example. Other online tasks are constructed in a similar manner. recognition system under a strict lexical density constraint of at most two words per 10-second interval, since the evaluated V… view at source ↗

**Figure 6.** Figure 6: Distributions of video data. (a) Distribution of video categories across 15 real-world scenarios. (b) Duration distribution of the OmniStarPro-Live partition at the second level. (c) Duration distribution of the OmniStarPro-Long partition at the minute level. InternLM2.5-7B [93] language model. InternViT extracts video frame embeddings at 1-4 FPS, with each frame represented by 16 tokens. For efficiency, … view at source ↗

**Figure 7.** Figure 7: Ablation study on the impact of response-silence threshold. is achieved within a narrow interval of α = 1.02–1.04, and we select α = 1.03 as the default setting. The narrowness of this optimal range reflects a well-understood property of perplexity-based thresholds: because perplexity is computed relative to the LM’s own probability distribution, its absolute scale varies across model families and domains,… view at source ↗

**Figure 8.** Figure 8: Comparison on the RNG task. LiveStarPro is timely and precise, while VideoLLM-online is repetitive and MMDuet often misses key points. hour-long streams. On the OmniStarPro-Long partition, LiveStarPro further sustains reliable recall across all three memorycentric tasks (long-range memory recall, cross-event difference query, and temporal backtracking), confirming that TSHM effectively mitigates catastro… view at source ↗

read the original abstract

Despite the remarkable progress of Video Large Language Models (Video-LLMs), current online architectures still struggle to simultaneously process continuous video streams, decide autonomously when to respond, and preserve long-horizon contextual memory. These obstacles undermine real-time responsiveness and cause severe forgetting throughout prolonged interactions. In this work, we introduce LiveStarPro, a live streaming assistant that is designed for proactive video understanding over long-horizon streams. The design of LiveStarPro rests on three complementary components. The first component is Streaming Verification Decoding (SVeD), an inference framework that identifies the appropriate response timing through single-pass perplexity verification, thereby eliminating the dependency on explicit silence tokens. The second component is Streaming Causal Attention Masks (SCAM), a training strategy that enforces incremental video-language alignment over variable-length streams. The third component is Tree-Structured Hierarchical Memory (TSHM), a recursive memory architecture that organizes evicted historical information into event chains and consequently enables efficient retrieval from effectively unbounded video streams. To facilitate a comprehensive evaluation under realistic online conditions, we further present OmniStarPro, a large-scale benchmark that spans 15 diverse real-world scenarios and that extends to hour-scale streams for the assessment of long-term recall. Extensive experiments demonstrate that LiveStarPro consistently surpasses existing methods, attaining a 28.9% improvement in semantic correctness and an 18.2% reduction in timing error, while its streaming key-value cache further yields a 1.58x inference speedup over the same model without caching. The model and the code are publicly available at https://github.com/sotayang/LiveStarPro.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LiveStarPro adds three named components for streaming video LLMs and a new long-stream benchmark, but TSHM's unbounded memory claim rests on unshown scaling behavior.

read the letter

The paper introduces LiveStarPro with SVeD for response timing via single-pass perplexity, SCAM for incremental alignment during training, and TSHM for recursive event-chain memory on evicted frames. It also releases OmniStarPro, a benchmark covering 15 scenarios up to hour-scale streams.

The components address real gaps in current online Video-LLMs: deciding when to answer without silence tokens, maintaining alignment over variable inputs, and trying to retain context beyond fixed windows. Public code and the benchmark are concrete positives that let others test the ideas directly.

The main soft spot is TSHM. The abstract claims efficient retrieval from effectively unbounded streams through recursive organization into event chains, yet the stress-test concern holds: no tree depth, query complexity, eviction details, or scaling curves appear for hour-long inputs. If retrieval cost grows linearly with events rather than staying sub-linear, the central long-horizon advantage does not follow even if SVeD and SCAM work.

The reported 28.9% semantic gain and 18.2% timing improvement are presented against existing methods, but the letter does not detail ablations or how the three pieces interact under the streaming KV cache. That leaves the combined system harder to evaluate.

This work is aimed at applied researchers building real-time multimodal assistants. A reader focused on streaming video systems would find the benchmark and component descriptions useful. It deserves peer review because it ships runnable code, a new evaluation set, and a clear problem framing, even though the memory scaling needs tighter evidence.

Referee Report

1 major / 1 minor

Summary. The paper introduces LiveStarPro, a proactive streaming video understanding system for long-horizon streams. It consists of three components: Streaming Verification Decoding (SVeD) to determine response timing via single-pass perplexity verification without silence tokens; Streaming Causal Attention Masks (SCAM) as a training strategy for incremental video-language alignment; and Tree-Structured Hierarchical Memory (TSHM) to recursively organize evicted frames into event chains for efficient retrieval. The work also presents the OmniStarPro benchmark spanning 15 scenarios and hour-scale streams. Experiments report 28.9% gains in semantic correctness, 18.2% reduction in timing error, and 1.58x speedup from the streaming KV cache, with code and model released publicly.

Significance. If the empirical gains and long-horizon claims hold under rigorous evaluation, the work would address key open problems in online Video-LLMs (autonomous response timing, incremental alignment, and scalable memory) with a concrete system and benchmark. The public code release is a clear strength for reproducibility and follow-on work.

major comments (1)

[Abstract (TSHM description)] Abstract (TSHM description): the central claim that TSHM enables 'efficient retrieval from effectively unbounded video streams' is load-bearing for the long-horizon results (28.9% semantic gain), yet no scaling analysis, query complexity bounds, tree depth, eviction policy, or interaction with the streaming KV cache is provided. Without these, it remains possible that retrieval cost grows linearly with stream length on the hour-scale OmniStarPro streams, undermining the 'proactive streaming over long-horizon' contribution.

minor comments (1)

[Abstract] The abstract states that 'the model and the code are publicly available at https://github.com/sotayang/LiveStarPro' but provides no commit hash, license, or reproducibility checklist; this should be expanded in the final version.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the TSHM description. We address the point below and will strengthen the manuscript with additional analysis.

read point-by-point responses

Referee: [Abstract (TSHM description)] Abstract (TSHM description): the central claim that TSHM enables 'efficient retrieval from effectively unbounded video streams' is load-bearing for the long-horizon results (28.9% semantic gain), yet no scaling analysis, query complexity bounds, tree depth, eviction policy, or interaction with the streaming KV cache is provided. Without these, it remains possible that retrieval cost grows linearly with stream length on the hour-scale OmniStarPro streams, undermining the 'proactive streaming over long-horizon' contribution.

Authors: We agree that the current version lacks explicit scaling analysis, complexity bounds, tree depth characterization, eviction policy details, and KV-cache interaction for TSHM. This is a valid observation. In the revision we will insert a dedicated subsection (likely in Section 3.3 or 4) that (i) derives the O(log N) query complexity arising from the recursive event-chain tree, (ii) reports observed tree depths on the hour-scale OmniStarPro streams (typically 4–6 levels), (iii) specifies the eviction policy based on event-chain importance scores, and (iv) explains how TSHM retrieval is fused with the streaming KV cache to avoid linear cost. These additions will directly support the long-horizon claims. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive system architecture with no derivations or self-referential fits

full rationale

The provided abstract and description contain no equations, derivations, fitted parameters presented as predictions, or self-citation chains that bear the central claims. The three components (SVeD, SCAM, TSHM) are introduced as design choices whose performance is evaluated empirically on the OmniStarPro benchmark; the reported gains (28.9% semantic correctness, 18.2% timing error reduction, 1.58x speedup) are external measurements rather than quantities forced by construction from the inputs. No load-bearing step reduces to a self-definition or renamed known result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no information is provided on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5833 in / 1277 out tokens · 53832 ms · 2026-06-27T01:10:02.028180+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

100 extracted references · 25 linked inside Pith

[1]

Expanding performance boundaries of open- source multimodal models with model, data, and test-time scaling,

Z. Chen, W. Wang, Y . Cao, Y . Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liuet al., “Expanding performance boundaries of open- source multimodal models with model, data, and test-time scaling,” arXiv preprint arXiv:2412.05271, 2024

Pith/arXiv arXiv 2024
[2]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y . Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin, “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

Pith/arXiv arXiv 2024
[3]

Minicpm-v: A gpt-4v level mllm on your phone,

Y . Yao, T. Yu, A. Zhang, C. Wang, J. Cui, H. Zhu, T. Cai, H. Li, W. Zhao, Z. Heet al., “Minicpm-v: A gpt-4v level mllm on your phone,”arXiv preprint arXiv:2408.01800, 2024

Pith/arXiv arXiv 2024
[4]

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output,

P. Zhang, X. Dong, Y . Zang, Y . Cao, R. Qian, L. Chen, Q. Guo, H. Duan, B. Wang, L. Ouyanget al., “Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output,” arXiv preprint arXiv:2407.03320, 2024

Pith/arXiv arXiv 2024
[5]

Chatglm: A family of large language models from glm-130b to glm-4 all tools,

T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Zhang, D. Rojas, G. Feng, H. Zhaoet al., “Chatglm: A family of large language models from glm-130b to glm-4 all tools,”arXiv preprint arXiv:2406.12793, 2024

Pith/arXiv arXiv 2024
[6]

Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens,

K. Ataallah, X. Shen, E. Abdelrahman, E. Sleiman, D. Zhu, J. Ding, and M. Elhoseiny, “Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens,”arXiv preprint arXiv:2404.03413, 2024

arXiv 2024
[7]

Video-chatgpt: Towards detailed video understanding via large vision and language models,

M. Maaz, H. Rasheed, S. Khan, and F. S. Khan, “Video-chatgpt: Towards detailed video understanding via large vision and language models,” arXiv preprint arXiv:2306.05424, 2023

Pith/arXiv arXiv 2023
[8]

Videochat: Chat-centric video understanding,

K. Li, Y . He, Y . Wang, Y . Li, W. Wang, P. Luo, Y . Wang, L. Wang, and Y . Qiao, “Videochat: Chat-centric video understanding,”arXiv preprint arXiv:2305.06355, 2023

Pith/arXiv arXiv 2023
[9]

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning,

A. Yang, A. Nagrani, P. H. Seo, A. Miech, J. Pont-Tuset, I. Laptev, J. Sivic, and C. Schmid, “Vid2seq: Large-scale pretraining of a visual language model for dense video captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 714–10 726

2023
[10]

Internvideo: General video foundation models via gen- erative and discriminative learning,

Y . Wang, K. Li, Y . Li, Y . He, B. Huang, Z. Zhao, H. Zhang, J. Xu, Y . Liu, Z. Wanget al., “Internvideo: General video foundation models via gen- erative and discriminative learning,”arXiv preprint arXiv:2212.03191, 2022

Pith/arXiv arXiv 2022
[11]

Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms,

Z. Cheng, S. Leng, H. Zhang, Y . Xin, X. Li, G. Chen, Y . Zhu, W. Zhang, Z. Luo, D. Zhaoet al., “Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms,”arXiv preprint arXiv:2406.07476, 2024

Pith/arXiv arXiv 2024
[12]

Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution,

Z. Liu, Y . Dong, Z. Liu, W. Hu, J. Lu, and Y . Rao, “Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution,”arXiv preprint arXiv:2409.12961, 2024

arXiv 2024
[13]

Timechat: A time-sensitive multimodal large language model for long video understanding,

S. Ren, L. Yao, S. Li, X. Sun, and L. Hou, “Timechat: A time-sensitive multimodal large language model for long video understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 313–14 323

2024
[14]

Flash-vstream: Memory-based real-time understanding for long video streams,

H. Zhang, Y . Wang, Y . Tang, Y . Liu, J. Feng, J. Dai, and X. Jin, “Flash-vstream: Memory-based real-time understanding for long video streams,”arXiv preprint arXiv:2406.08085, 2024

arXiv 2024
[15]

Moviechat+: Question-aware sparse memory for long video question answering,

E. Song, W. Chai, T. Ye, J.-N. Hwang, X. Li, and G. Wang, “Moviechat+: Question-aware sparse memory for long video question answering,” arXiv preprint arXiv:2404.17176, 2024

arXiv 2024
[16]

Ma-lmm: Memory-augmented large multimodal model for long-term video understanding,

B. He, H. Li, Y . K. Jang, M. Jia, X. Cao, A. Shah, A. Shrivastava, and S.-N. Lim, “Ma-lmm: Memory-augmented large multimodal model for long-term video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 504–13 514

2024
[17]

Longllava: Scaling multi-modal llms to 1000 images efficiently via hybrid architecture,

X. Wang, D. Song, S. Chen, C. Zhang, and B. Wang, “Longllava: Scaling multi-modal llms to 1000 images efficiently via hybrid architecture,”
[18]

Available: https://arxiv.org/abs/2409.02889

[Online]. Available: https://arxiv.org/abs/2409.02889

arXiv
[19]

Longvila: Scaling long-context visual language models for long videos,

F. Xue, Y . Chen, D. Li, Q. Hu, L. Zhu, X. Li, Y . Fang, H. Tang, S. Yang, Z. Liu, Y . He, H. Yin, P. Molchanov, J. Kautz, L. Fan, Y . Zhu, Y . Lu, and S. Han, “Longvila: Scaling long-context visual language models for long videos,”null, 2024

2024
[20]

Long context transfer from language to vision,

P. Zhang, K. Zhang, B. Li, G. Zeng, J. Yang, Y . Zhang, Z. Wang, H. Tan, C. Li, and Z. Liu, “Long context transfer from language to vision,”arXiv preprint arXiv:2406.16852, 2024

Pith/arXiv arXiv 2024
[21]

Videollm-online: Online video large language model for streaming video,

J. Chen, Z. Lv, S. Wu, K. Q. Lin, C. Song, D. Gao, J.-W. Liu, Z. Gao, D. Mao, and M. Z. Shou, “Videollm-online: Online video large language model for streaming video,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 407–18 418. PREPRINT, 2026 17

2024
[22]

Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation,

S. Wu, J. Chen, K. Q. Lin, Q. Wang, Y . Gao, Q. Xu, T. Xu, Y . Hu, E. Chen, and M. Z. Shou, “Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation,”Advances in Neural Information Processing Systems, vol. 37, pp. 109 922–109 947, 2024

2024
[23]

Lion-fs: Fast & slow video-language thinker as online video assistant,

W. Li, B. Hu, R. Shao, L. Shen, and L. Nie, “Lion-fs: Fast & slow video-language thinker as online video assistant,”arXiv preprint arXiv:2503.03663, 2025

arXiv 2025
[24]

Stream- mind: Unlocking full frame rate streaming video dialogue through event- gated cognition,

X. Ding, H. Wu, Y . Yang, S. Jiang, D. Bai, Z. Chen, and T. Cao, “Stream- mind: Unlocking full frame rate streaming video dialogue through event- gated cognition,”arXiv preprint arXiv:2503.06220, 2025

arXiv 2025
[25]

Videollm knows when to speak: Enhancing time-sensitive video com- prehension with video-text duet interaction format,

Y . Wang, X. Meng, Y . Wang, J. Liang, J. Wei, H. Zhang, and D. Zhao, “Videollm knows when to speak: Enhancing time-sensitive video com- prehension with video-text duet interaction format,”arXiv preprint arXiv:2411.17991, 2024

arXiv 2024
[26]

Streaming long video understanding with large language models,

R. Qian, X. Dong, P. Zhang, Y . Zang, S. Ding, D. Lin, and J. Wang, “Streaming long video understanding with large language models,”Ad- vances in Neural Information Processing Systems, vol. 37, pp. 119 336– 119 360, 2024

2024
[27]

Streaming dense video captioning,

X. Zhou, A. Arnab, S. Buch, S. Yan, A. Myers, X. Xiong, A. Nagrani, and C. Schmid, “Streaming dense video captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 243–18 252

2024
[28]

Streaming video understanding and multi-round interaction with memory-enhanced knowledge,

H. Xiong, Z. Yang, J. Yu, Y . Zhuge, L. Zhang, J. Zhu, and H. Lu, “Streaming video understanding and multi-round interaction with memory-enhanced knowledge,”arXiv preprint arXiv:2501.13468, 2025

arXiv 2025
[29]

Ego4d: Around the world in 3,000 hours of egocentric video,

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liuet al., “Ego4d: Around the world in 3,000 hours of egocentric video,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 995–19 012

2022
[30]

Soccernet: A scalable dataset for action spotting in soccer videos,

S. Giancola, M. Amine, T. Dghaily, and B. Ghanem, “Soccernet: A scalable dataset for action spotting in soccer videos,” inProceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 1711–1721

2018
[31]

Svbench: A benchmark with temporal multi-turn dialogues for streaming video understanding,

Z. Yang, Y . Hu, Z. Du, D. Xue, S. Qian, J. Wu, F. Yang, W. Dong, and C. Xu, “Svbench: A benchmark with temporal multi-turn dialogues for streaming video understanding,”arXiv preprint arXiv:2502.10810, 2025

arXiv 2025
[32]

Ovo-bench: How far is your video- llms from real-world online video understanding?

Y . Li, J. Niu, Z. Miao, C. Ge, Y . Zhou, Q. He, X. Dong, H. Duan, S. Ding, R. Qianet al., “Ovo-bench: How far is your video- llms from real-world online video understanding?”arXiv preprint arXiv:2501.05510, 2025

arXiv 2025
[33]

Livestar: Live streaming assistant for real-world online video understanding,

Z. Yang, K. Zhang, Y . Hu, B. Wang, S. Qian, B. Wen, F. Yang, T. Gao, W. Dong, and C. Xu, “Livestar: Live streaming assistant for real-world online video understanding,”Advances in Neural Information Processing Systems, vol. 38, pp. 31 266–31 304, 2026

2026
[34]

Llama 2: Open foundation and fine-tuned chat models,

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

Pith/arXiv arXiv 2023
[35]

Gemini: a family of highly capable multimodal models,

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millicanet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

Pith/arXiv arXiv 2023
[36]

Gpt-4 technical report,

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023
[37]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

2022
[38]

Improving language understanding by generative pre-training,

A. Radford, K. Narasimhan, T. Salimans, I. Sutskeveret al., “Improving language understanding by generative pre-training,”Unknown, 2018

2018
[39]

Video dataflywheel: Resolving the impossible data trinity in video-language understanding,

X. Wang, J. Wu, Z. Lin, F. Zhang, D. Zhang, and L. Nie, “Video dataflywheel: Resolving the impossible data trinity in video-language understanding,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[40]

Object-centric rep- resentation learning for video scene understanding,

Y . Zhou, H. Zhang, S.-I. Park, B. Yoo, and X. Qi, “Object-centric rep- resentation learning for video scene understanding,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 8410– 8423, 2024

2024
[41]

Sharegpt4video: Improving video understanding and generation with better captions,

L. Chen, X. Wei, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, Z. Tang, L. Yuanet al., “Sharegpt4video: Improving video understanding and generation with better captions,”Advances in Neural Information Processing Systems, vol. 37, pp. 19 472–19 495, 2024

2024
[42]

Pllava: Parameter-free llava extension from images to videos for video dense captioning,

L. Xu, Y . Zhao, D. Zhou, Z. Lin, S. K. Ng, and J. Feng, “Pllava: Parameter-free llava extension from images to videos for video dense captioning,”arXiv preprint arXiv:2404.16994, 2024

Pith/arXiv arXiv 2024
[43]

Video recap: Recursive captioning of hour-long videos,

M. M. Islam, N. Ho, X. Yang, T. Nagarajan, L. Torresani, and G. Bertasius, “Video recap: Recursive captioning of hour-long videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 198–18 208

2024
[44]

Large language models are temporal and causal reasoners for video question answering,

D. Ko, J. S. Lee, W. Kang, B. Roh, and H. J. Kim, “Large language models are temporal and causal reasoners for video question answering,” arXiv preprint arXiv:2310.15747, 2023

arXiv 2023
[45]

Mvbench: A comprehensive multi-modal video understanding benchmark,

K. Li, Y . Wang, Y . He, Y . Li, Y . Wang, Y . Liu, Z. Wang, J. Xu, G. Chen, P. Luoet al., “Mvbench: A comprehensive multi-modal video understanding benchmark,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 195–22 206

2024
[46]

Videogpt+: Integrating image and video encoders for enhanced video understanding,

M. Maaz, H. Rasheed, S. Khan, and F. Khan, “Videogpt+: Integrating image and video encoders for enhanced video understanding,”arXiv preprint arXiv:2406.09418, 2024

arXiv 2024
[47]

Learning to answer visual questions from web videos,

A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid, “Learning to answer visual questions from web videos,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 5, pp. 3202–3218, 2025

2025
[48]

Transformer-empowered invariant grounding for video question answering,

Y . Li, X. Wang, J. Xiao, W. Ji, and T.-S. Chua, “Transformer-empowered invariant grounding for video question answering,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 11, pp. 9510– 9522, 2025

2025
[49]

Intentqa: Intent question answering in videos by cognitive context reasoning,

J. Li, P. Wei, W. Han, S.-C. Zhu, and L. Fan, “Intentqa: Intent question answering in videos by cognitive context reasoning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–18, 2026

2026
[50]

Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding,

Y . Guo, J. Liu, M. Li, D. Cheng, X. Tang, D. Sui, Q. Liu, X. Chen, and K. Zhao, “Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 3, 2025, pp. 3302– 3310

2025
[51]

Vtg-gpt: Tuning-free zero- shot video temporal grounding with gpt,

Y . Xu, Y . Sun, Z. Xie, B. Zhai, and S. Du, “Vtg-gpt: Tuning-free zero- shot video temporal grounding with gpt,”Applied Sciences, vol. 14, no. 5, p. 1894, 2024

2024
[52]

Hawkeye: Training video-text llms for grounding text in videos,

Y . Wang, X. Meng, J. Liang, Y . Wang, Q. Liu, and D. Zhao, “Hawkeye: Training video-text llms for grounding text in videos,”arXiv preprint arXiv:2403.10228, 2024

arXiv 2024
[53]

Llava-next: A strong zero-shot video understanding model,

Y . Zhang, B. Li, h. Liu, Y . j. Lee, L. Gui, D. Fu, J. Feng, Z. Liu, and C. Li, “Llava-next: A strong zero-shot video understanding model,” April 2024. [Online]. Available: https://llava-vl.github.io/blog/2024-04- 30-llava-next-video/

2024
[54]

Video-llava: Learning united visual representation by alignment before projection,

B. Lin, B. Zhu, Y . Ye, M. Ning, P. Jin, and L. Yuan, “Video-llava: Learning united visual representation by alignment before projection,” arXiv preprint arXiv:2311.10122, 2023

Pith/arXiv arXiv 2023
[55]

Vila: On pre-training for visual language models,

J. Lin, H. Yin, W. Ping, P. Molchanov, M. Shoeybi, and S. Han, “Vila: On pre-training for visual language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 689–26 699

2024
[56]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,

M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieseret al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024

Pith/arXiv arXiv 2024
[57]

Valor: Vision-audio-language omni-perception pretraining model and dataset,

J. Liu, S. Chen, X. He, L. Guo, X. Zhu, W. Wang, and J. Tang, “Valor: Vision-audio-language omni-perception pretraining model and dataset,”IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. 47, no. 2, pp. 708–724, 2025

2025
[58]

Cap4video++: Enhancing video understanding with auxiliary captions,

W. Wu, X. Wang, H. Luo, J. Wang, Y . Yang, and W. Ouyang, “Cap4video++: Enhancing video understanding with auxiliary captions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 7, pp. 5223–5237, 2025

2025
[59]

Hierarchical banzhaf interaction for general video-language representation learning,

P. Jin, H. Li, L. Yuan, S. Yan, and J. Chen, “Hierarchical banzhaf interaction for general video-language representation learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 3, pp. 2125–2139, 2025

2025
[60]

Video dataflywheel: Resolving the impossible data trinity in video-language understanding,

X. Wang, J. Wu, Z. Lin, F. Zhang, D. Zhang, and L. Nie, “Video dataflywheel: Resolving the impossible data trinity in video-language understanding,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 4, pp. 2912–2923, 2025

2025
[61]

Livechat: A large- scale personalized dialogue dataset automatically constructed from live streaming,

J. Gao, Y . Lian, Z. Zhou, Y . Fu, and B. Wang, “Livechat: A large- scale personalized dialogue dataset automatically constructed from live streaming,”arXiv preprint arXiv:2306.08401, 2023

arXiv 2023
[62]

Don’t pause: Streaming video-language synchrony for online video understanding,

Z. Yang, K. Zhang, S. Qian, W. Dong, and C. Xu, “Don’t pause: Streaming video-language synchrony for online video understanding,” arXiv preprint arXiv:2606.06991, 2026

Pith/arXiv arXiv 2026
[63]

Querystream: Advancing streaming video understanding with query-aware pruning PREPRINT, 2026 18 and proactive response,

K. Zhang, Z. Yang, B. Wang, S. Qian, and C. Xu, “Querystream: Advancing streaming video understanding with query-aware pruning PREPRINT, 2026 18 and proactive response,” inThe Fourteenth International Conference on Learning Representations, 2026

2026
[64]

Languagebind: Extending video-language pre- training to n-modality by language-based semantic alignment,

B. Zhu, B. Lin, M. Ning, Y . Yan, J. Cui, H. Wang, Y . Pang, W. Jiang, J. Zhang, Z. Liet al., “Languagebind: Extending video-language pre- training to n-modality by language-based semantic alignment,”arXiv preprint arXiv:2310.01852, 2023

Pith/arXiv arXiv 2023
[65]

Egoschema: A diagnostic benchmark for very long-form video language understanding,

K. Mangalam, R. Akshulakov, J. Maliket al., “Egoschema: A diagnostic benchmark for very long-form video language understanding,”Advances in Neural Information Processing Systems, vol. 36, 2024

2024
[66]

Activitynet- qa: A dataset for understanding complex web videos via question answering,

Z. Yu, D. Xu, J. Yu, T. Yu, Z. Zhao, Y . Zhuang, and D. Tao, “Activitynet- qa: A dataset for understanding complex web videos via question answering,” inProceedings of the AAAI Conference on Artificial In- telligence, vol. 33, no. 01, 2019, pp. 9127–9134

2019
[67]

Hero: Hier- archical encoder for video+ language omni-representation pre-training,

L. Li, Y .-C. Chen, Y . Cheng, Z. Gan, L. Yu, and J. Liu, “Hero: Hier- archical encoder for video+ language omni-representation pre-training,” arXiv preprint arXiv:2005.00200, 2020

arXiv 2005
[68]

Per- ception test: A diagnostic benchmark for multimodal video models,

V . Patraucean, L. Smaira, A. Gupta, A. Recasens, L. Markeeva, D. Ba- narse, S. Koppula, M. Malinowski, Y . Yang, C. Doerschet al., “Per- ception test: A diagnostic benchmark for multimodal video models,” Advances in Neural Information Processing Systems, vol. 36, 2024

2024
[69]

Social- iq: A question answering benchmark for artificial social intelligence,

A. Zadeh, M. Chan, P. P. Liang, E. Tong, and L.-P. Morency, “Social- iq: A question answering benchmark for artificial social intelligence,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8807–8817

2019
[70]

Video question answering via gradually refined attention over appearance and motion,

D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y . Zhuang, “Video question answering via gradually refined attention over appearance and motion,” inProceedings of the 25th ACM international conference on Multimedia, 2017, pp. 1645–1653

2017
[71]

Tvqa: Localized, compo- sitional video question answering,

J. Lei, L. Yu, M. Bansal, and T. L. Berg, “Tvqa: Localized, compo- sitional video question answering,”arXiv preprint arXiv:1809.01696, 2018

Pith/arXiv arXiv 2018
[72]

Next-qa: Next phase of question-answering to explaining temporal actions,

J. Xiao, X. Shang, A. Yao, and T.-S. Chua, “Next-qa: Next phase of question-answering to explaining temporal actions,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 9777–9786

2021
[73]

Moviechat: From dense token to sparse memory for long video understanding,

E. Song, W. Chai, G. Wang, Y . Zhang, H. Zhou, F. Wu, H. Chi, X. Guo, T. Ye, Y . Zhanget al., “Moviechat: From dense token to sparse memory for long video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 221–18 232

2024
[74]

Lvbench: An extreme long video understanding benchmark,

W. Wang, Z. He, W. Hong, Y . Cheng, X. Zhang, J. Qi, S. Huang, B. Xu, Y . Dong, M. Dinget al., “Lvbench: An extreme long video understanding benchmark,”arXiv preprint arXiv:2406.08035, 2024

Pith/arXiv arXiv 2024
[75]

Tgif-qa: Toward spatio- temporal reasoning in visual question answering,

Y . Jang, Y . Song, Y . Yu, Y . Kim, and G. Kim, “Tgif-qa: Toward spatio- temporal reasoning in visual question answering,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2758–2766

2017
[76]

Moviechat+: Question-aware sparse memory for long video question answering,

E. Song, W. Chai, T. Ye, J.-N. Hwang, X. Li, and G. Wang, “Moviechat+: Question-aware sparse memory for long video question answering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 1, pp. 374–389, 2026

2026
[77]

Momentor++: Advancing video large language models with fine-grained long video reasoning,

J. Li, M. Gao, X. He, S. Tang, W.-S. Zheng, J. Xiao, M. Wang, T.-S. Chua, and Y . Zhuang, “Momentor++: Advancing video large language models with fine-grained long video reasoning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 6, pp. 6208– 6224, 2026

2026
[78]

Selongvlm: Empowering long video language models with self-corrective clip selection,

K. Zhang, Z. Yang, M. Han, Y . Zhuge, H. Hao, C. Li, Z. Li, and X. Chang, “Selongvlm: Empowering long video language models with self-corrective clip selection,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–16, 2026

2026
[79]

Ego-r1: Agentic chain-of-tool-thought for ultra- long egocentric video reasoning,

S. Tian, R. Wang, H. Guo, P. Wu, Y . Dong, X. Wang, J. Yang, H. Zhang, H. Zhu, and Z. Liu, “Ego-r1: Agentic chain-of-tool-thought for ultra- long egocentric video reasoning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–16, 2026

2026
[80]

Hier-egopack: Hierarchical egocentric video understanding with diverse task perspectives,

S. A. Peirone, F. Pistilli, A. Alliegro, T. Tommasi, and G. Averta, “Hier-egopack: Hierarchical egocentric video understanding with diverse task perspectives,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 2, pp. 1917–1931, 2026

1917

Showing first 80 references.

[1] [1]

Expanding performance boundaries of open- source multimodal models with model, data, and test-time scaling,

Z. Chen, W. Wang, Y . Cao, Y . Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liuet al., “Expanding performance boundaries of open- source multimodal models with model, data, and test-time scaling,” arXiv preprint arXiv:2412.05271, 2024

Pith/arXiv arXiv 2024

[2] [2]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y . Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin, “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

Pith/arXiv arXiv 2024

[3] [3]

Minicpm-v: A gpt-4v level mllm on your phone,

Y . Yao, T. Yu, A. Zhang, C. Wang, J. Cui, H. Zhu, T. Cai, H. Li, W. Zhao, Z. Heet al., “Minicpm-v: A gpt-4v level mllm on your phone,”arXiv preprint arXiv:2408.01800, 2024

Pith/arXiv arXiv 2024

[4] [4]

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output,

P. Zhang, X. Dong, Y . Zang, Y . Cao, R. Qian, L. Chen, Q. Guo, H. Duan, B. Wang, L. Ouyanget al., “Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output,” arXiv preprint arXiv:2407.03320, 2024

Pith/arXiv arXiv 2024

[5] [5]

Chatglm: A family of large language models from glm-130b to glm-4 all tools,

T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Zhang, D. Rojas, G. Feng, H. Zhaoet al., “Chatglm: A family of large language models from glm-130b to glm-4 all tools,”arXiv preprint arXiv:2406.12793, 2024

Pith/arXiv arXiv 2024

[6] [6]

Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens,

K. Ataallah, X. Shen, E. Abdelrahman, E. Sleiman, D. Zhu, J. Ding, and M. Elhoseiny, “Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens,”arXiv preprint arXiv:2404.03413, 2024

arXiv 2024

[7] [7]

Video-chatgpt: Towards detailed video understanding via large vision and language models,

M. Maaz, H. Rasheed, S. Khan, and F. S. Khan, “Video-chatgpt: Towards detailed video understanding via large vision and language models,” arXiv preprint arXiv:2306.05424, 2023

Pith/arXiv arXiv 2023

[8] [8]

Videochat: Chat-centric video understanding,

K. Li, Y . He, Y . Wang, Y . Li, W. Wang, P. Luo, Y . Wang, L. Wang, and Y . Qiao, “Videochat: Chat-centric video understanding,”arXiv preprint arXiv:2305.06355, 2023

Pith/arXiv arXiv 2023

[9] [9]

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning,

A. Yang, A. Nagrani, P. H. Seo, A. Miech, J. Pont-Tuset, I. Laptev, J. Sivic, and C. Schmid, “Vid2seq: Large-scale pretraining of a visual language model for dense video captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 714–10 726

2023

[10] [10]

Internvideo: General video foundation models via gen- erative and discriminative learning,

Y . Wang, K. Li, Y . Li, Y . He, B. Huang, Z. Zhao, H. Zhang, J. Xu, Y . Liu, Z. Wanget al., “Internvideo: General video foundation models via gen- erative and discriminative learning,”arXiv preprint arXiv:2212.03191, 2022

Pith/arXiv arXiv 2022

[11] [11]

Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms,

Z. Cheng, S. Leng, H. Zhang, Y . Xin, X. Li, G. Chen, Y . Zhu, W. Zhang, Z. Luo, D. Zhaoet al., “Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms,”arXiv preprint arXiv:2406.07476, 2024

Pith/arXiv arXiv 2024

[12] [12]

Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution,

Z. Liu, Y . Dong, Z. Liu, W. Hu, J. Lu, and Y . Rao, “Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution,”arXiv preprint arXiv:2409.12961, 2024

arXiv 2024

[13] [13]

Timechat: A time-sensitive multimodal large language model for long video understanding,

S. Ren, L. Yao, S. Li, X. Sun, and L. Hou, “Timechat: A time-sensitive multimodal large language model for long video understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 313–14 323

2024

[14] [14]

Flash-vstream: Memory-based real-time understanding for long video streams,

H. Zhang, Y . Wang, Y . Tang, Y . Liu, J. Feng, J. Dai, and X. Jin, “Flash-vstream: Memory-based real-time understanding for long video streams,”arXiv preprint arXiv:2406.08085, 2024

arXiv 2024

[15] [15]

Moviechat+: Question-aware sparse memory for long video question answering,

E. Song, W. Chai, T. Ye, J.-N. Hwang, X. Li, and G. Wang, “Moviechat+: Question-aware sparse memory for long video question answering,” arXiv preprint arXiv:2404.17176, 2024

arXiv 2024

[16] [16]

Ma-lmm: Memory-augmented large multimodal model for long-term video understanding,

B. He, H. Li, Y . K. Jang, M. Jia, X. Cao, A. Shah, A. Shrivastava, and S.-N. Lim, “Ma-lmm: Memory-augmented large multimodal model for long-term video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 504–13 514

2024

[17] [17]

Longllava: Scaling multi-modal llms to 1000 images efficiently via hybrid architecture,

X. Wang, D. Song, S. Chen, C. Zhang, and B. Wang, “Longllava: Scaling multi-modal llms to 1000 images efficiently via hybrid architecture,”

[18] [18]

Available: https://arxiv.org/abs/2409.02889

[Online]. Available: https://arxiv.org/abs/2409.02889

arXiv

[19] [19]

Longvila: Scaling long-context visual language models for long videos,

F. Xue, Y . Chen, D. Li, Q. Hu, L. Zhu, X. Li, Y . Fang, H. Tang, S. Yang, Z. Liu, Y . He, H. Yin, P. Molchanov, J. Kautz, L. Fan, Y . Zhu, Y . Lu, and S. Han, “Longvila: Scaling long-context visual language models for long videos,”null, 2024

2024

[20] [20]

Long context transfer from language to vision,

P. Zhang, K. Zhang, B. Li, G. Zeng, J. Yang, Y . Zhang, Z. Wang, H. Tan, C. Li, and Z. Liu, “Long context transfer from language to vision,”arXiv preprint arXiv:2406.16852, 2024

Pith/arXiv arXiv 2024

[21] [21]

Videollm-online: Online video large language model for streaming video,

J. Chen, Z. Lv, S. Wu, K. Q. Lin, C. Song, D. Gao, J.-W. Liu, Z. Gao, D. Mao, and M. Z. Shou, “Videollm-online: Online video large language model for streaming video,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 407–18 418. PREPRINT, 2026 17

2024

[22] [22]

Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation,

S. Wu, J. Chen, K. Q. Lin, Q. Wang, Y . Gao, Q. Xu, T. Xu, Y . Hu, E. Chen, and M. Z. Shou, “Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation,”Advances in Neural Information Processing Systems, vol. 37, pp. 109 922–109 947, 2024

2024

[23] [23]

Lion-fs: Fast & slow video-language thinker as online video assistant,

W. Li, B. Hu, R. Shao, L. Shen, and L. Nie, “Lion-fs: Fast & slow video-language thinker as online video assistant,”arXiv preprint arXiv:2503.03663, 2025

arXiv 2025

[24] [24]

Stream- mind: Unlocking full frame rate streaming video dialogue through event- gated cognition,

X. Ding, H. Wu, Y . Yang, S. Jiang, D. Bai, Z. Chen, and T. Cao, “Stream- mind: Unlocking full frame rate streaming video dialogue through event- gated cognition,”arXiv preprint arXiv:2503.06220, 2025

arXiv 2025

[25] [25]

Videollm knows when to speak: Enhancing time-sensitive video com- prehension with video-text duet interaction format,

Y . Wang, X. Meng, Y . Wang, J. Liang, J. Wei, H. Zhang, and D. Zhao, “Videollm knows when to speak: Enhancing time-sensitive video com- prehension with video-text duet interaction format,”arXiv preprint arXiv:2411.17991, 2024

arXiv 2024

[26] [26]

Streaming long video understanding with large language models,

R. Qian, X. Dong, P. Zhang, Y . Zang, S. Ding, D. Lin, and J. Wang, “Streaming long video understanding with large language models,”Ad- vances in Neural Information Processing Systems, vol. 37, pp. 119 336– 119 360, 2024

2024

[27] [27]

Streaming dense video captioning,

X. Zhou, A. Arnab, S. Buch, S. Yan, A. Myers, X. Xiong, A. Nagrani, and C. Schmid, “Streaming dense video captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 243–18 252

2024

[28] [28]

Streaming video understanding and multi-round interaction with memory-enhanced knowledge,

H. Xiong, Z. Yang, J. Yu, Y . Zhuge, L. Zhang, J. Zhu, and H. Lu, “Streaming video understanding and multi-round interaction with memory-enhanced knowledge,”arXiv preprint arXiv:2501.13468, 2025

arXiv 2025

[29] [29]

Ego4d: Around the world in 3,000 hours of egocentric video,

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liuet al., “Ego4d: Around the world in 3,000 hours of egocentric video,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 995–19 012

2022

[30] [30]

Soccernet: A scalable dataset for action spotting in soccer videos,

S. Giancola, M. Amine, T. Dghaily, and B. Ghanem, “Soccernet: A scalable dataset for action spotting in soccer videos,” inProceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 1711–1721

2018

[31] [31]

Svbench: A benchmark with temporal multi-turn dialogues for streaming video understanding,

Z. Yang, Y . Hu, Z. Du, D. Xue, S. Qian, J. Wu, F. Yang, W. Dong, and C. Xu, “Svbench: A benchmark with temporal multi-turn dialogues for streaming video understanding,”arXiv preprint arXiv:2502.10810, 2025

arXiv 2025

[32] [32]

Ovo-bench: How far is your video- llms from real-world online video understanding?

Y . Li, J. Niu, Z. Miao, C. Ge, Y . Zhou, Q. He, X. Dong, H. Duan, S. Ding, R. Qianet al., “Ovo-bench: How far is your video- llms from real-world online video understanding?”arXiv preprint arXiv:2501.05510, 2025

arXiv 2025

[33] [33]

Livestar: Live streaming assistant for real-world online video understanding,

Z. Yang, K. Zhang, Y . Hu, B. Wang, S. Qian, B. Wen, F. Yang, T. Gao, W. Dong, and C. Xu, “Livestar: Live streaming assistant for real-world online video understanding,”Advances in Neural Information Processing Systems, vol. 38, pp. 31 266–31 304, 2026

2026

[34] [34]

Llama 2: Open foundation and fine-tuned chat models,

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

Pith/arXiv arXiv 2023

[35] [35]

Gemini: a family of highly capable multimodal models,

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millicanet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

Pith/arXiv arXiv 2023

[36] [36]

Gpt-4 technical report,

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023

[37] [37]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

2022

[38] [38]

Improving language understanding by generative pre-training,

A. Radford, K. Narasimhan, T. Salimans, I. Sutskeveret al., “Improving language understanding by generative pre-training,”Unknown, 2018

2018

[39] [39]

Video dataflywheel: Resolving the impossible data trinity in video-language understanding,

X. Wang, J. Wu, Z. Lin, F. Zhang, D. Zhang, and L. Nie, “Video dataflywheel: Resolving the impossible data trinity in video-language understanding,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025

[40] [40]

Object-centric rep- resentation learning for video scene understanding,

Y . Zhou, H. Zhang, S.-I. Park, B. Yoo, and X. Qi, “Object-centric rep- resentation learning for video scene understanding,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 8410– 8423, 2024

2024

[41] [41]

Sharegpt4video: Improving video understanding and generation with better captions,

L. Chen, X. Wei, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, Z. Tang, L. Yuanet al., “Sharegpt4video: Improving video understanding and generation with better captions,”Advances in Neural Information Processing Systems, vol. 37, pp. 19 472–19 495, 2024

2024

[42] [42]

Pllava: Parameter-free llava extension from images to videos for video dense captioning,

L. Xu, Y . Zhao, D. Zhou, Z. Lin, S. K. Ng, and J. Feng, “Pllava: Parameter-free llava extension from images to videos for video dense captioning,”arXiv preprint arXiv:2404.16994, 2024

Pith/arXiv arXiv 2024

[43] [43]

Video recap: Recursive captioning of hour-long videos,

M. M. Islam, N. Ho, X. Yang, T. Nagarajan, L. Torresani, and G. Bertasius, “Video recap: Recursive captioning of hour-long videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 198–18 208

2024

[44] [44]

Large language models are temporal and causal reasoners for video question answering,

D. Ko, J. S. Lee, W. Kang, B. Roh, and H. J. Kim, “Large language models are temporal and causal reasoners for video question answering,” arXiv preprint arXiv:2310.15747, 2023

arXiv 2023

[45] [45]

Mvbench: A comprehensive multi-modal video understanding benchmark,

K. Li, Y . Wang, Y . He, Y . Li, Y . Wang, Y . Liu, Z. Wang, J. Xu, G. Chen, P. Luoet al., “Mvbench: A comprehensive multi-modal video understanding benchmark,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 195–22 206

2024

[46] [46]

Videogpt+: Integrating image and video encoders for enhanced video understanding,

M. Maaz, H. Rasheed, S. Khan, and F. Khan, “Videogpt+: Integrating image and video encoders for enhanced video understanding,”arXiv preprint arXiv:2406.09418, 2024

arXiv 2024

[47] [47]

Learning to answer visual questions from web videos,

A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid, “Learning to answer visual questions from web videos,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 5, pp. 3202–3218, 2025

2025

[48] [48]

Transformer-empowered invariant grounding for video question answering,

Y . Li, X. Wang, J. Xiao, W. Ji, and T.-S. Chua, “Transformer-empowered invariant grounding for video question answering,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 11, pp. 9510– 9522, 2025

2025

[49] [49]

Intentqa: Intent question answering in videos by cognitive context reasoning,

J. Li, P. Wei, W. Han, S.-C. Zhu, and L. Fan, “Intentqa: Intent question answering in videos by cognitive context reasoning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–18, 2026

2026

[50] [50]

Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding,

Y . Guo, J. Liu, M. Li, D. Cheng, X. Tang, D. Sui, Q. Liu, X. Chen, and K. Zhao, “Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 3, 2025, pp. 3302– 3310

2025

[51] [51]

Vtg-gpt: Tuning-free zero- shot video temporal grounding with gpt,

Y . Xu, Y . Sun, Z. Xie, B. Zhai, and S. Du, “Vtg-gpt: Tuning-free zero- shot video temporal grounding with gpt,”Applied Sciences, vol. 14, no. 5, p. 1894, 2024

2024

[52] [52]

Hawkeye: Training video-text llms for grounding text in videos,

Y . Wang, X. Meng, J. Liang, Y . Wang, Q. Liu, and D. Zhao, “Hawkeye: Training video-text llms for grounding text in videos,”arXiv preprint arXiv:2403.10228, 2024

arXiv 2024

[53] [53]

Llava-next: A strong zero-shot video understanding model,

Y . Zhang, B. Li, h. Liu, Y . j. Lee, L. Gui, D. Fu, J. Feng, Z. Liu, and C. Li, “Llava-next: A strong zero-shot video understanding model,” April 2024. [Online]. Available: https://llava-vl.github.io/blog/2024-04- 30-llava-next-video/

2024

[54] [54]

Video-llava: Learning united visual representation by alignment before projection,

B. Lin, B. Zhu, Y . Ye, M. Ning, P. Jin, and L. Yuan, “Video-llava: Learning united visual representation by alignment before projection,” arXiv preprint arXiv:2311.10122, 2023

Pith/arXiv arXiv 2023

[55] [55]

Vila: On pre-training for visual language models,

J. Lin, H. Yin, W. Ping, P. Molchanov, M. Shoeybi, and S. Han, “Vila: On pre-training for visual language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 689–26 699

2024

[56] [56]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,

M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieseret al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024

Pith/arXiv arXiv 2024

[57] [57]

Valor: Vision-audio-language omni-perception pretraining model and dataset,

J. Liu, S. Chen, X. He, L. Guo, X. Zhu, W. Wang, and J. Tang, “Valor: Vision-audio-language omni-perception pretraining model and dataset,”IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. 47, no. 2, pp. 708–724, 2025

2025

[58] [58]

Cap4video++: Enhancing video understanding with auxiliary captions,

W. Wu, X. Wang, H. Luo, J. Wang, Y . Yang, and W. Ouyang, “Cap4video++: Enhancing video understanding with auxiliary captions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 7, pp. 5223–5237, 2025

2025

[59] [59]

Hierarchical banzhaf interaction for general video-language representation learning,

P. Jin, H. Li, L. Yuan, S. Yan, and J. Chen, “Hierarchical banzhaf interaction for general video-language representation learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 3, pp. 2125–2139, 2025

2025

[60] [60]

Video dataflywheel: Resolving the impossible data trinity in video-language understanding,

X. Wang, J. Wu, Z. Lin, F. Zhang, D. Zhang, and L. Nie, “Video dataflywheel: Resolving the impossible data trinity in video-language understanding,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 4, pp. 2912–2923, 2025

2025

[61] [61]

Livechat: A large- scale personalized dialogue dataset automatically constructed from live streaming,

J. Gao, Y . Lian, Z. Zhou, Y . Fu, and B. Wang, “Livechat: A large- scale personalized dialogue dataset automatically constructed from live streaming,”arXiv preprint arXiv:2306.08401, 2023

arXiv 2023

[62] [62]

Don’t pause: Streaming video-language synchrony for online video understanding,

Z. Yang, K. Zhang, S. Qian, W. Dong, and C. Xu, “Don’t pause: Streaming video-language synchrony for online video understanding,” arXiv preprint arXiv:2606.06991, 2026

Pith/arXiv arXiv 2026

[63] [63]

Querystream: Advancing streaming video understanding with query-aware pruning PREPRINT, 2026 18 and proactive response,

K. Zhang, Z. Yang, B. Wang, S. Qian, and C. Xu, “Querystream: Advancing streaming video understanding with query-aware pruning PREPRINT, 2026 18 and proactive response,” inThe Fourteenth International Conference on Learning Representations, 2026

2026

[64] [64]

Languagebind: Extending video-language pre- training to n-modality by language-based semantic alignment,

B. Zhu, B. Lin, M. Ning, Y . Yan, J. Cui, H. Wang, Y . Pang, W. Jiang, J. Zhang, Z. Liet al., “Languagebind: Extending video-language pre- training to n-modality by language-based semantic alignment,”arXiv preprint arXiv:2310.01852, 2023

Pith/arXiv arXiv 2023

[65] [65]

Egoschema: A diagnostic benchmark for very long-form video language understanding,

K. Mangalam, R. Akshulakov, J. Maliket al., “Egoschema: A diagnostic benchmark for very long-form video language understanding,”Advances in Neural Information Processing Systems, vol. 36, 2024

2024

[66] [66]

Activitynet- qa: A dataset for understanding complex web videos via question answering,

Z. Yu, D. Xu, J. Yu, T. Yu, Z. Zhao, Y . Zhuang, and D. Tao, “Activitynet- qa: A dataset for understanding complex web videos via question answering,” inProceedings of the AAAI Conference on Artificial In- telligence, vol. 33, no. 01, 2019, pp. 9127–9134

2019

[67] [67]

Hero: Hier- archical encoder for video+ language omni-representation pre-training,

L. Li, Y .-C. Chen, Y . Cheng, Z. Gan, L. Yu, and J. Liu, “Hero: Hier- archical encoder for video+ language omni-representation pre-training,” arXiv preprint arXiv:2005.00200, 2020

arXiv 2005

[68] [68]

Per- ception test: A diagnostic benchmark for multimodal video models,

V . Patraucean, L. Smaira, A. Gupta, A. Recasens, L. Markeeva, D. Ba- narse, S. Koppula, M. Malinowski, Y . Yang, C. Doerschet al., “Per- ception test: A diagnostic benchmark for multimodal video models,” Advances in Neural Information Processing Systems, vol. 36, 2024

2024

[69] [69]

Social- iq: A question answering benchmark for artificial social intelligence,

A. Zadeh, M. Chan, P. P. Liang, E. Tong, and L.-P. Morency, “Social- iq: A question answering benchmark for artificial social intelligence,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8807–8817

2019

[70] [70]

Video question answering via gradually refined attention over appearance and motion,

D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y . Zhuang, “Video question answering via gradually refined attention over appearance and motion,” inProceedings of the 25th ACM international conference on Multimedia, 2017, pp. 1645–1653

2017

[71] [71]

Tvqa: Localized, compo- sitional video question answering,

J. Lei, L. Yu, M. Bansal, and T. L. Berg, “Tvqa: Localized, compo- sitional video question answering,”arXiv preprint arXiv:1809.01696, 2018

Pith/arXiv arXiv 2018

[72] [72]

Next-qa: Next phase of question-answering to explaining temporal actions,

J. Xiao, X. Shang, A. Yao, and T.-S. Chua, “Next-qa: Next phase of question-answering to explaining temporal actions,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 9777–9786

2021

[73] [73]

Moviechat: From dense token to sparse memory for long video understanding,

E. Song, W. Chai, G. Wang, Y . Zhang, H. Zhou, F. Wu, H. Chi, X. Guo, T. Ye, Y . Zhanget al., “Moviechat: From dense token to sparse memory for long video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 221–18 232

2024

[74] [74]

Lvbench: An extreme long video understanding benchmark,

W. Wang, Z. He, W. Hong, Y . Cheng, X. Zhang, J. Qi, S. Huang, B. Xu, Y . Dong, M. Dinget al., “Lvbench: An extreme long video understanding benchmark,”arXiv preprint arXiv:2406.08035, 2024

Pith/arXiv arXiv 2024

[75] [75]

Tgif-qa: Toward spatio- temporal reasoning in visual question answering,

Y . Jang, Y . Song, Y . Yu, Y . Kim, and G. Kim, “Tgif-qa: Toward spatio- temporal reasoning in visual question answering,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2758–2766

2017

[76] [76]

Moviechat+: Question-aware sparse memory for long video question answering,

E. Song, W. Chai, T. Ye, J.-N. Hwang, X. Li, and G. Wang, “Moviechat+: Question-aware sparse memory for long video question answering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 1, pp. 374–389, 2026

2026

[77] [77]

Momentor++: Advancing video large language models with fine-grained long video reasoning,

J. Li, M. Gao, X. He, S. Tang, W.-S. Zheng, J. Xiao, M. Wang, T.-S. Chua, and Y . Zhuang, “Momentor++: Advancing video large language models with fine-grained long video reasoning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 6, pp. 6208– 6224, 2026

2026

[78] [78]

Selongvlm: Empowering long video language models with self-corrective clip selection,

K. Zhang, Z. Yang, M. Han, Y . Zhuge, H. Hao, C. Li, Z. Li, and X. Chang, “Selongvlm: Empowering long video language models with self-corrective clip selection,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–16, 2026

2026

[79] [79]

Ego-r1: Agentic chain-of-tool-thought for ultra- long egocentric video reasoning,

S. Tian, R. Wang, H. Guo, P. Wu, Y . Dong, X. Wang, J. Yang, H. Zhang, H. Zhu, and Z. Liu, “Ego-r1: Agentic chain-of-tool-thought for ultra- long egocentric video reasoning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–16, 2026

2026

[80] [80]

Hier-egopack: Hierarchical egocentric video understanding with diverse task perspectives,

S. A. Peirone, F. Pistilli, A. Alliegro, T. Tommasi, and G. Averta, “Hier-egopack: Hierarchical egocentric video understanding with diverse task perspectives,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 2, pp. 1917–1931, 2026

1917