pith. sign in

arxiv: 2606.06338 · v1 · pith:VRCPXJCKnew · submitted 2026-06-04 · 💻 cs.CV

StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated Dataset

Pith reviewed 2026-06-28 02:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords Video Question AnsweringDeep Video UnderstandingStoryline ReasoningLarge-scale DatasetHierarchical Plot StructureMulti-agent GenerationTV Series and Movies
0
0 comments X

The pith

VideoQA models cannot maintain long-range character associations or coherent storyline understanding on a new 363K-question benchmark of TV series and movies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs StoryVideoQA, the largest deep video understanding dataset with over 363,000 questions on 393 hours of story videos including longer movies, using an improved multi-agent generation process called StoryMindv2. Evaluations of 20 existing VideoQA methods on this benchmark show they fail to track characters across extended sequences or build coherent pictures of complex plots. The authors introduce PlotTree, an agent that converts long video content into a hierarchical plot structure to support more effective storyline reasoning.

Core claim

Existing VideoQA approaches excel on factoid questions but cannot sustain long-range character associations or construct coherent understanding of complex storylines when tested at scale; StoryVideoQA supplies the first benchmark large and diverse enough to expose this gap across both TV series and full-length movies, while PlotTree demonstrates that reorganizing video into hierarchical plots enables efficient reasoning over those storylines.

What carries the argument

PlotTree, a video understanding agent that re-organizes long-range video content into a hierarchical plot structure to support storyline reasoning.

If this is right

  • Current state-of-the-art VideoQA methods cannot fully maintain long-range character associations across extended videos.
  • They also cannot construct coherent understanding of complex storylines on the scale of full movies.
  • Re-organizing video content into a hierarchical plot structure enables more efficient storyline reasoning.
  • The scale and diversity of StoryVideoQA expose limitations not visible in smaller, manually constructed DVU datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • PlotTree's hierarchical approach may generalize to other long-form sequential reasoning tasks such as multi-document summarization or long-horizon planning.
  • If the generation pipeline scales without degradation, similar automatic construction methods could be applied to create benchmarks in adjacent domains like audio stories or game logs.
  • The observed failures suggest that future VideoQA architectures may need explicit memory or graph-based structures for character tracking rather than relying solely on transformer attention.

Load-bearing premise

The supervisor-guided multi-agent generation and multi-reviewer voting process produces high-quality, balanced question-answer pairs that accurately reflect complex storylines without systematic bias or hallucination.

What would settle it

A controlled human audit of several thousand generated question-answer pairs for factual accuracy, storyline fidelity, and absence of bias, followed by re-running the 20 baseline models on the audited subset.

read the original abstract

Video question answering (VideoQA) aims to answer questions about given videos. While existing approaches excel on factoid VideoQA, they struggle with deep video understanding (DVU), which requires the comprehension of complex storylines. This challenge arises from the inherent long-range video content, multi-faceted question types, and instance-level story elements, all of which constrain the scale and diversity of manually constructed DVU datasets. These difficulties constrain the scale and diversity of manually-constructed DVU dataset. To address these, we previously introduced StoryMind to automatically construct DVU datasets with balanced fine-grained topics. Though it can generate high-quality question-answer pairs (QAs) for TV series, it suffers significant performance degradation when handling longer and more complex movies. In this paper, we further design StoryMindv2, an enhanced multi-agent collaboration framework to generate high-quality DVU datasets for both TV series and movies. By integrating a novel supervisor-guided generation mechanism and a refined multi-reviewer voting strategy, the framework is utilized to construct StoryVideoQA, the largest DVU dataset to date, featuring over 363K QAs on 393.2 hours diverse story videos including TV series (avg. 1,635 seconds) and movies (avg. 7,878 seconds). Comprehensive evaluations of 20 state-of-the-art VideoQA methods on this large-scale benchmark reveal that they cannot fully maintain long-range character associations or construct a coherent understanding of complex storylines. To bridge this gap, we propose PlotTree, a novel video understanding agent, re-organizing long-range video content into a hierarchical plot structure, enabling efficient storyline reasoning on StoryVideoQA. Project page: https://github.com/nercms-mmap/StoryVideoQA/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce StoryVideoQA, the largest DVU dataset to date with 363K QAs over 393.2 hours of TV series (avg. 1635s) and movies (avg. 7878s), generated via StoryMindv2 (an enhanced multi-agent framework with supervisor-guided generation and multi-reviewer voting). It reports that evaluations of 20 SOTA VideoQA models on this benchmark show failures to maintain long-range character associations or coherent storyline understanding, and proposes PlotTree, a hierarchical plot-structure agent, to enable better reasoning.

Significance. If the dataset quality holds and the evaluations are robust, the work would deliver a valuable large-scale, multi-genre benchmark exposing clear limitations in current VideoQA methods for complex narratives, while the PlotTree agent offers a concrete architectural direction for long-range reasoning; the scale and genre diversity would be a notable contribution to the field.

major comments (2)
  1. [Abstract and StoryMindv2 construction section] Abstract and the section on StoryMindv2 / StoryVideoQA construction: the headline claim that 20 models 'cannot fully maintain long-range character associations or construct a coherent understanding of complex storylines' rests entirely on the fidelity of the 363K auto-generated QAs, yet no human validation, inter-annotator agreement, or accuracy comparison against manually curated subsets is reported (especially for the 7878-second movies where v1 degradation was acknowledged). This is load-bearing for both the evaluation results and the motivation for PlotTree.
  2. [StoryMindv2 / multi-reviewer voting description] The section describing the multi-reviewer voting strategy: the assertion that the refined voting produces 'high-quality' and 'balanced' QAs for movies is presented without any quantitative metrics (e.g., agreement rates, hallucination rates, or external human ratings), leaving the central evaluation results unanchored.
minor comments (2)
  1. [Abstract] Abstract contains a repeated sentence ('These difficulties constrain the scale and diversity of manually constructed DVU datasets.') and a minor grammatical issue ('manually-constructed DVU dataset' should be plural).
  2. [Introduction / Related Work] The relationship to the prior StoryMind paper is referenced but the specific quantitative improvements of v2 over v1 on movie-length content are not tabulated or highlighted.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the validation of StoryVideoQA. We address the two major comments below and agree that additional quantitative evidence is needed to support the dataset quality claims.

read point-by-point responses
  1. Referee: [Abstract and StoryMindv2 construction section] Abstract and the section on StoryMindv2 / StoryVideoQA construction: the headline claim that 20 models 'cannot fully maintain long-range character associations or construct a coherent understanding of complex storylines' rests entirely on the fidelity of the 363K auto-generated QAs, yet no human validation, inter-annotator agreement, or accuracy comparison against manually curated subsets is reported (especially for the 7878-second movies where v1 degradation was acknowledged). This is load-bearing for both the evaluation results and the motivation for PlotTree.

    Authors: We agree that the fidelity of the auto-generated QAs is central to the headline claims and the motivation for PlotTree. StoryMindv2 extends the prior StoryMind framework (which included some validation for TV series) with supervisor-guided generation and multi-reviewer voting specifically to address degradation on longer movies, but the current manuscript does not report new human validation, inter-annotator agreement, or direct accuracy comparisons against manual subsets. We will add a human evaluation study on sampled QAs (stratified across TV and movies), including accuracy rates, inter-annotator agreement, and comparison to manually curated references. This will be incorporated into the revised manuscript. revision: yes

  2. Referee: [StoryMindv2 / multi-reviewer voting description] The section describing the multi-reviewer voting strategy: the assertion that the refined voting produces 'high-quality' and 'balanced' QAs for movies is presented without any quantitative metrics (e.g., agreement rates, hallucination rates, or external human ratings), leaving the central evaluation results unanchored.

    Authors: We acknowledge that the manuscript asserts high-quality and balanced QAs from the refined multi-reviewer voting without accompanying quantitative metrics such as agreement rates or hallucination rates. The voting mechanism is intended to enforce consensus and filter issues, but no explicit numbers are provided. In the revision we will report reviewer agreement statistics, any hallucination filtering steps, and tie these to the human evaluation results noted above to better anchor the quality claims. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset generation, benchmark evaluations, and PlotTree proposal form independent chain

full rationale

The paper's derivation proceeds from describing StoryMindv2 (an enhanced multi-agent framework extending prior StoryMind), applying it to produce the 363K-QA StoryVideoQA dataset across TV series and movies, running direct empirical evaluations of 20 existing VideoQA models on that benchmark to measure failures on long-range associations, and introducing PlotTree as a hierarchical agent to address observed gaps. None of these steps reduce by construction to their inputs via self-definition, fitted parameters renamed as predictions, or load-bearing self-citations that are themselves unverified. The self-reference to StoryMind is limited to motivating the v2 improvements and is not invoked as an external uniqueness theorem or to force the evaluation outcomes. The benchmark results and PlotTree design remain falsifiable against the generated data and external models without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the unverified premise that multi-agent generation yields faithful story-level QAs at scale; no free parameters are described, standard ML assumptions about video encoders apply, and PlotTree is an invented hierarchical structure without external falsifiable evidence provided.

axioms (1)
  • domain assumption Multi-agent collaboration with supervisor guidance and voting produces high-quality, unbiased question-answer pairs that capture complex storylines in long videos.
    Invoked when describing construction of StoryVideoQA from TV series and movies; no human validation or external check is mentioned.
invented entities (1)
  • PlotTree no independent evidence
    purpose: Re-organize long-range video content into a hierarchical plot structure for efficient storyline reasoning.
    New agent proposed to address limitations of existing VideoQA methods on the new dataset.

pith-pipeline@v0.9.1-grok · 5883 in / 1628 out tokens · 31315 ms · 2026-06-28T02:03:37.264032+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

119 extracted references · 23 canonical work pages · 17 internal anchors

  1. [1]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Ren, S., Yao, L., Li, S., Sun, X., Hou, L.: Timechat: A time-sensitive multimodal large language model for long video under- standing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14313–14323 (2024)

  2. [2]

    In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sat- tler, T., Varol, G

    Huang, D.-A., Liao, S., Radhakrishnan, S., Yin, H., Molchanov, P., Yu, Z., Kautz, J.: Lita: Language instructed temporal- localization assistant. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sat- tler, T., Varol, G. (eds.) Computer Vision – ECCV 2024, pp. 202–218. Springer, Cham (2025)

  3. [3]

    In: Proceedings of the 41st International Con- ference on Machine Learning, pp

    Qian, L., Li, J., Wu, Y., Ye, Y., Fei, H., Chua, T.-S., Zhuang, Y., Tang, S.: Momen- tor: advancing video large language model with fine-grained temporal reasoning. In: Proceedings of the 41st International Con- ference on Machine Learning, pp. 41340– 41356 (2024)

  4. [4]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

    Lee, M.J., Gong, D., Cho, M.: Video sum- marization with large language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 18981– 18991 (2025)

  5. [5]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Argaw, D.M., Yoon, S., Heilbron, F.C., Deil- amsalehy, H., Bui, T., Wang, Z., Dernon- court, F., Chung, J.S.: Scaling up video sum- marization pretraining with large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8332–8341 (2024)

  6. [6]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    He, B., Wang, J., Qiu, J., Bui, T., Shrivas- tava, A., Wang, Z.: Align and attend: Multi- modal summarization with dual contrastive losses. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14867–14878 (2023)

  7. [7]

    In: Proceedings of the 2025 International Conference on Multimedia Retrieval

    Wu, Z., Wang, X., Chang, H., Chen, H., Sun, L., Zhu, W.: Aligning large multimodal 23 model with sequential recommendation via content-behavior guidance. In: Proceedings of the 2025 International Conference on Multimedia Retrieval. ICMR ’25, pp. 1507–

  8. [8]

    Association for Computing Machinery, New York, NY, USA (2025)

  9. [9]

    In: MultiMedia Modeling, pp

    Gu, G., Wu, Z., He, J., Song, L., Wang, Z., Liang, C.: Talksee: Interactive video retrieval engine using large language model. In: MultiMedia Modeling, pp. 387–393. Springer, Cham (2024)

  10. [10]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp

    Galanopoulos, D., Goulas, A., Leven- takis, A., Patras, I., Mezaris, V.: An llm framework for long-form video retrieval and audio-visual question answering using qwen2/2.5. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 3739–3748 (2025)

  11. [11]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp

    Jin, P., Takanobu, R., Zhang, W., Cao, X., Yuan, L.: Chat-univi: Unified visual repre- sentation empowers large language models with image and video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 13700–13710 (2024)

  12. [12]

    In: Ku, L.-W., Martins, A., Srikumar, V

    Maaz, M., Rasheed, H., Khan, S., Khan, F.: Video-ChatGPT: Towards detailed video understanding via large vision and lan- guage models. In: Ku, L.-W., Martins, A., Srikumar, V. (eds.) Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12585–12602. Association for Computational Linguistics, B...

  13. [13]

    Li, K., He, Y., Wang, Y., Li, Y., Wang, W., Luo, P., Wang, Y., Wang, L., Qiao, Y.: VideoChat: Chat-Centric Video Under- standing (2024)

  14. [14]

    Proceedings of the AAAI Conference on Artificial Intel- ligence33, 9127–9134 (2019)

    Yu, Z., Xu, D., Yu, J., Yu, T., Zhao, Z., Zhuang, Y., Tao, D.: Activitynet-qa: A dataset for understanding complex web videos via question answering. Proceedings of the AAAI Conference on Artificial Intel- ligence33, 9127–9134 (2019)

  15. [15]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Xiao, J., Shang, X., Yao, A., Chua, T.-S.: Next-qa: Next phase of question-answering to explaining temporal actions. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9777–9786 (2021)

  16. [16]

    In: 2023 IEEE Interna- tional Conference on Multimedia and Expo (ICME), pp

    Guo, J., Liang, C., Wang, Z.: Who, what and where: Composite-semantic instance search for story videos. In: 2023 IEEE Interna- tional Conference on Multimedia and Expo (ICME), pp. 858–863 (2023). IEEE

  17. [17]

    IEEE Transactions on Image Processing34, 1412–1426 (2025)

    Guo, J., Lu, A., Wu, Z., Wang, Z., Liang, C.: Who, what, and where: Composite- semantics instance search for story videos. IEEE Transactions on Image Processing34, 1412–1426 (2025)

  18. [18]

    Proceedings of the AAAI Conference on Artificial Intelligence39(8), 8523–8531 (2025)

    Wu, Z., Li, R., Xu, Z., Wang, Z., Xiao, C., Liang, C.: Friendsqa: A new large- scale deep video understanding dataset with fine-grained topic categorization for story videos. Proceedings of the AAAI Conference on Artificial Intelligence39(8), 8523–8531 (2025)

  19. [19]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

    Tapaswi, M., Zhu, Y., Stiefelhagen, R., Tor- ralba, A., Urtasun, R., Fidler, S.: Movieqa: Understanding stories in movies through question-answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4631–4640 (2016)

  20. [20]

    In: Riloff, E., Chiang, D., Hock- enmaier, J., Tsujii, J

    Lei, J., Yu, L., Bansal, M., Berg, T.: TVQA: Localized, compositional video question answering. In: Riloff, E., Chiang, D., Hock- enmaier, J., Tsujii, J. (eds.) Proceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing, pp. 1369–1379. Association for Computational Linguistics, Brussels, Belgium (2018)

  21. [21]

    In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J

    Lei, J., Yu, L., Berg, T., Bansal, M.: TVQA+: Spatio-temporal grounding for video question answering. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguis- tics, pp. 8211–8225. Association for Compu- tational Linguistics, Online (2020) 24

  22. [22]

    In: Proceedings of the 2020 International Conference on Multimedia Retrieval

    Curtis, K., Awad, G., Rajput, S., Sobo- roff, I.: Hlvu: A new challenge to test deep understanding of movies the way humans do. In: Proceedings of the 2020 International Conference on Multimedia Retrieval. ICMR ’20, pp. 355–361. Association for Computing Machinery, New York, NY, USA (2020)

  23. [23]

    In: Proceedings of the AAAI Conference on Artificial Intelli- gence, vol

    Choi, S., On, K.-W., Heo, Y.-J., Seo, A., Jang, Y., Lee, M., Zhang, B.-T.: Dramaqa: Character-centered video story understand- ing with hierarchical qa. In: Proceedings of the AAAI Conference on Artificial Intelli- gence, vol. 35, pp. 1166–1174 (2021)

  24. [24]

    In: Proceedings of the 17th Conference of the European Chap- ter of the Association for Computational Linguistics, pp

    Fung, Y., Wang, H., Wang, T., Kebarighotbi, A., Bansal, M., Ji, H., Natarajan, P.: Deepmaven: Deep question answering on long-distance movie/tv show videos with multimedia knowledge extrac- tion and synthesis. In: Proceedings of the 17th Conference of the European Chap- ter of the Association for Computational Linguistics, pp. 3041–3051 (2023)

  25. [25]

    arXiv preprint arXiv:2405.08813 (2024)

    Rawal, R., Saifullah, K., Basri, R., Jacobs, D., Somepalli, G., Goldstein, T.: Cinepile: A long video question answering dataset and benchmark. arXiv preprint arXiv:2405.08813 (2024)

  26. [26]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Song, E., Chai, W., Wang, G., Zhang, Y., Zhou, H., Wu, F., Chi, H., Guo, X., Ye, T., Zhang, Y.,et al.: Moviechat: From dense token to sparse memory for long video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18221–18232 (2024)

  27. [27]

    International Journal of Com- puter Vision133(11), 7726–7747 (2025)

    Zhang, H., Dong, L., Liu, Y., Huang, Y., Wang, Y., Wang, L., Qiao, Y.: Lvbench: A benchmark for long-form video understand- ing with versatile multi-modal question answering. International Journal of Com- puter Vision133(11), 7726–7747 (2025)

  28. [28]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pp

    Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., Chen, P., Li, Y., Lin, S., Zhao, S., Li, K., Xu, T., Zheng, X., Chen, E., Shan, C., He, R., Sun, X.: Video-mme: The first- ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the IEEE/CVF Conference on Computer Visio...

  29. [29]

    In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

    Wu, H., Li, D., Chen, B., Li, J.: Longvideobench: A benchmark for long-context interleaved video-language understanding. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Information Processing Systems, vol. 37, pp. 28828–28857 (2024)

  30. [30]

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    Hu, K., Wu, P., Pu, F., Xiao, W., Zhang, Y., Yue, X., Li, B., Liu, Z.: Video-mmmu: Eval- uating knowledge acquisition from multi- discipline professional videos. arXiv preprint arXiv:2501.13826 (2025)

  31. [31]

    In: The Thirteenth International Conference on Learning Representations

    Chen, G., Liu, Y., Huang, Y., Pei, B., Xu, J., He, Y., Lu, T., Wang, Y., Wang, L.: Cg- bench: Clue-grounded question answering benchmark for long video understanding. In: The Thirteenth International Conference on Learning Representations

  32. [32]

    Vrbench: A benchmark for multi-step reasoning in long narrative videos,

    Yu, J., Wu, Y., Chu, M., Ren, Z., Huang, Z., Chu, P., Zhang, R., He, Y., Li, Q., Li, S., et al.: Vrbench: A benchmark for multi-step reasoning in long narrative videos. arXiv preprint arXiv:2506.10857 (2025)

  33. [33]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp

    Wang, W., He, Z., Hong, W., Cheng, Y., Zhang, X., Qi, J., Ding, M., Gu, X., Huang, S., Xu, B., Dong, Y., Tang, J.: LVBench: An Extreme Long Video Under- standing Benchmark. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 22958–22967 (2025)

  34. [34]

    Image (2020)

    He, X., Zhu, W.: Visual question answering from theory to application. Image (2020)

  35. [35]

    In: Goldberg, Y., Kozareva, Z., Zhang, 25 Y

    Zhong, Y., Ji, W., Xiao, J., Li, Y., Deng, W., Chua, T.-S.: Video question answer- ing: Datasets, algorithms and challenges. In: Goldberg, Y., Kozareva, Z., Zhang, 25 Y. (eds.) Proceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing, pp. 6439–6455. Asso- ciation for Computational Linguistics, Abu Dhabi, United Arab Emirates (2022)

  36. [36]

    In: Ku, L.-W., Martins, A., Srikumar, V

    Nguyen, T., Bin, Y., Xiao, J., Qu, L., Li, Y., Wu, J.Z., Nguyen, C.-D., Ng, S.-K., Luu, A.T.: Video-language understanding: A sur- vey from model architecture, model train- ing, and data perspectives. In: Ku, L.-W., Martins, A., Srikumar, V. (eds.) Findings of the Association for Computational Linguis- tics: ACL 2024, pp. 3636–3657. Association for Comput...

  37. [37]

    International Journal of Computer Vision, 1–24 (2025)

    Xiao, J., Huang, N., Qin, H., Li, D., Li, Y., Zhu, F., Tao, Z., Yu, J., Lin, L., Chua, T.-S., et al.: Videoqa in the era of llms: An empirical study. International Journal of Computer Vision, 1–24 (2025)

  38. [38]

    In: Proceed- ings of the 25th ACM International Con- ference on Multimedia

    Xu, D., Zhao, Z., Xiao, J., Wu, F., Zhang, H., He, X., Zhuang, Y.: Video question answering via gradually refined attention over appearance and motion. In: Proceed- ings of the 25th ACM International Con- ference on Multimedia. MM ’17, pp. 1645–

  39. [39]

    Association for Computing Machinery, New York, NY, USA (2017)

  40. [40]

    In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y

    Yu, Y., Kim, J., Kim, G.: A joint sequence fusion model for video question answering and retrieval. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018, pp. 487–503. Springer, Cham (2018)

  41. [41]

    In: Webber, B., Cohn, T., He, Y., Liu, Y

    Li, L., Chen, Y.-C., Cheng, Y., Gan, Z., Yu, L., Liu, J.: HERO: Hierarchical encoder for Video+Language omni-representation pre- training. In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.) Proceedings of the 2020 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pp. 2046–2065. Association for Computational Linguistics, Online (2020)

  42. [42]

    8731–8772 (2024)

    Liu, Y., Li, S., Liu, Y., Wang, Y., Ren, S., Li, L., Chen, S., Sun, X., Hou, L.: Temp- compass: Do video llms really understand videos? In: Findings of the Association for Computational Linguistics ACL 2024, pp. 8731–8772 (2024)

  43. [43]

    arXiv preprint arXiv:2406.11303 (2024)

    Li, Y., Chen, X., Hu, B., Wang, L., Shi, H., Zhang, M.: Videovista: A versatile bench- mark for video understanding and reason- ing. arXiv preprint arXiv:2406.11303 (2024)

  44. [44]

    Star: A benchmark for situated reasoning in real-world videos.arXiv preprint arXiv:2405.09711, 2024

    Wu, B., Yu, S., Chen, Z., Tenenbaum, J.B., Gan, C.: Star: A benchmark for situated rea- soning in real-world videos. arXiv preprint arXiv:2405.09711 (2024)

  45. [45]

    Advances in Neural Information Pro- cessing Systems36, 46212–46244 (2023)

    Mangalam, K., Akshulakov, R., Malik, J.: Egoschema: A diagnostic benchmark for very long-form video language understand- ing. Advances in Neural Information Pro- cessing Systems36, 46212–46244 (2023)

  46. [46]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., Wang, L., Qiao, Y.: Mvbench: A comprehensive multi-modal video under- standing benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22195–22206 (2024)

  47. [47]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Ham- burger, J., Jiang, H., Liu, M., Liu, X.,et al.: Ego4d: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18995–19012 (2022)

  48. [48]

    In: Proceedings of the 33rd ACM International Conference on Multi- media

    Xu, Z., Guo, J., Zhang, C., Wang, Z., Xiao, C., Liang, C.: Quantum interference-inspired who-what-where composite-semantics instance search for story videos. In: Proceedings of the 33rd ACM International Conference on Multi- media. MM ’25, pp. 4166–4174, New York, NY, USA (2025)

  49. [49]

    IEEE Transactions on Multimedia15(2), 401–414 (2013) 26

    Liang, C., Xu, C., Cheng, J., Min, W., Lu, H.: Script-to-movie: A computational frame- work for story movie composition. IEEE Transactions on Multimedia15(2), 401–414 (2013) 26

  50. [50]

    In: CVPR 2011, pp

    Liang, C., Xu, C., Cheng, J., Lu, H.: Tvparser: An automatic tv video parsing method. In: CVPR 2011, pp. 3377–3384 (2011)

  51. [51]

    In: Proceed- ings of the 30th ACM International Con- ference on Multimedia

    Curtis, K., Awad, G., Rajput, S., Soboroff, I.: The acm multimedia 2022 deep video understanding grand challenge. In: Proceed- ings of the 30th ACM International Con- ference on Multimedia. MM ’22, pp. 7075–

  52. [52]

    Association for Computing Machinery, New York, NY, USA (2022)

  53. [53]

    In: Proceedings of the 31st ACM International Conference on Multimedia

    Curtis, K., Awad, G., Godil, A., Soboroff, I.: The acm multimedia 2023 deep video under- standing grand challenge. In: Proceedings of the 31st ACM International Conference on Multimedia. MM ’23, pp. 9606–9609. Association for Computing Machinery, New York, NY, USA (2023)

  54. [54]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Kukleva, A., Tapaswi, M., Laptev, I.: Learn- ing interactions and relationships between movie characters. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9849– 9858 (2020)

  55. [55]

    In: Proceedings of the 31st ACM International Conference on Multimedia, pp

    Li, R., Guo, J., Li, M., Wu, Z., Liang, C.: A hierarchical deep video understand- ing method with shot-based instance search and large language model. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 9425–9429 (2023)

  56. [56]

    In: Proceed- ings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pp

    Li, D., Li, J., Li, H., Niebles, J.C., Hoi, S.C.: Align and prompt: Video-and-language pre- training with entity prompts. In: Proceed- ings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pp. 4953–4963 (2022)

  57. [57]

    In: Rogers, A., Boyd-Graber, J., Okazaki, N

    Lei, J., Berg, T., Bansal, M.: Revealing single frame bias for video-and-language learning. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 487–507. Association for Com- putational Linguistics, Toronto, Canada (2023)

  58. [58]

    In: International Conference on Machine Learning, pp

    Xu, H., Ye, Q., Yan, M., Shi, Y., Ye, J., Xu, Y., Li, C., Bi, B., Qian, Q., Wang, W.,et al.: mplug-2: A modularized multi- modal foundation model across text, image and video. In: International Conference on Machine Learning, pp. 38728–38748 (2023). PMLR

  59. [59]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Choi, J., Lee, S., Chu, J., Choi, M., Kim, H.J.: vid-tldr: Training free token merging for light-weight video transformer. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18771–18781 (2024)

  60. [60]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pp

    Fu, T.-J., Li, L., Gan, Z., Lin, K., Wang, W.Y., Wang, L., Liu, Z.: An empirical study of end-to-end video-language trans- formers with masked visual modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pp. 22898–22909 (2023)

  61. [61]

    Technical report (2023)

    OpenAI: Chatgpt: A language model for conversational ai openai. Technical report (2023)

  62. [62]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu, J., Soricut, R., Schalk- wyk, J., Dai, A.M., Hauth, A., et al.: Gem- ini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  63. [63]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Reid, M., Savinov, N., Teplyashin, D., Lep- ikhin, D., Lillicrap, T., Alayrac, J.-b., Sori- cut, R., Lazaridou, A., Firat, O., Schrit- twieser, J., et al.: Gemini 1.5: Unlock- ing multimodal understanding across mil- lions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

  64. [64]

    The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

    Yang, Z., Li, L., Lin, K., Wang, J., Lin, C.-C., Liu, Z., Wang, L.: The dawn of lmms: Preliminary explorations with gpt- 4v (ision). arXiv preprint arXiv:2309.17421 9(1) (2023)

  65. [65]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Mar- tinet, X., Lachaux, M.-A., Lacroix, T., Rozi` ere, B., Goyal, N., Hambro, E., Azhar, 27 F., et al.: Llama: Open and efficient foun- dation language models. arXiv preprint arXiv:2302.13971 (2023)

  66. [66]

    See https://vicuna

    Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., et al.: Vicuna: An open-source chatbot impress- ing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) (2023)

  67. [67]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  68. [68]

    Journal of Machine Learning Research 25(70), 1–53 (2024)

    Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S.S., Dai, Z., Suzgun, M., Chen, X., Chowd- hery, A., Castro-Ros, A., Pellat, M., Robin- son, K., Valter, D., Narang, S., Mishra, G., Yu, A., Zhao, V., Huang, Y., Dai, A., Yu, H., Petrov, S., Chi, E.H., Dean, J., Devlin, J., Roberts...

  69. [69]

    A Survey on Multimodal Large Language Models

    Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multi- modal large language models. arXiv preprint arXiv:2306.13549 (2023)

  70. [70]

    In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scar- lett, J

    Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large lan- guage models. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scar- lett, J. (eds.) Proceedings of the 40th Inter- national Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 202...

  71. [71]

    In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Infor- mation Processing Systems, vol. 36, pp. 34892–34916 (2023)

  72. [72]

    Advances in neural informa- tion processing systems35, 23716–23736 (2022)

    Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M.,et al.: Flamingo: a visual language model for few- shot learning. Advances in neural informa- tion processing systems35, 23716–23736 (2022)

  73. [73]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023)

  74. [74]

    In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N

    Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-LLaVA: Learn- ing united visual representation by align- ment before projection. In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N. (eds.) Proceedings of the 2024 Conference on Empirical Meth- ods in Natural Language Processing, pp. 5971–5984. Association for Computational Linguistics, Miami, ...

  75. [75]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Cheng, Z., Leng, S., Zhang, H., Xin, Y., Li, X., Chen, G., Zhu, Y., Zhang, W., Luo, Z., Zhao, D., Bing, L.: Videollama 2: Advanc- ing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476 (2024)

  76. [76]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Zhang, B., Li, K., Cheng, Z., Hu, Z., Yuan, Y., Chen, G., Leng, S., Jiang, Y., Zhang, H., Li, X., et al.: Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106 (2025)

  77. [77]

    Long Context Transfer from Language to Vision

    Zhang, P., Zhang, K., Li, B., Zeng, G., Yang, J., Zhang, Y., Wang, Z., Tan, H., Li, C., Liu, Z.: Long context transfer from language to vision. arXiv preprint arXiv:2406.16852 (2024)

  78. [78]

    LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

    Shen, X., Xiong, Y., Zhao, C., Wu, L., Chen, J., Zhu, C., Liu, Z., Xiao, F., Varadarajan, B., Bordes, F., Liu, Z., Xu, H., J. Kim, H., Soran, B., Krishnamoor- thi, R., Elhoseiny, M., Chandra, V.: Longvu: Spatiotemporal adaptive compression for long video-language understanding. arXiv 28 preprint arXiv:2410.17434 (2024)

  79. [79]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pp

    He, B., Li, H., Jang, Y.K., Jia, M., Cao, X., Shah, A., Shrivastava, A., Lim, S.-N.: Ma- lmm: Memory-augmented large multimodal model for long-term video understanding. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pp. 13504–13514 (2024)

  80. [80]

    In: Pro- ceedings of the Computer Vision and Pat- tern Recognition Conference, pp

    Man, Y., Huang, Y., Zhang, C., Li, B., Niu, W., Yin, M.: Adacmˆ 2: On understand- ing extremely long-term video with adaptive cross-modality memory reduction. In: Pro- ceedings of the Computer Vision and Pat- tern Recognition Conference, pp. 8534–8544 (2025)

Showing first 80 references.