pith. sign in

arxiv: 2406.08035 · v3 · pith:Y77WGRQVnew · submitted 2024-06-12 · 💻 cs.CV · cs.AI

LVBench: An Extreme Long Video Understanding Benchmark

Pith reviewed 2026-05-19 11:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords long video understandingmultimodal large language modelsvideo benchmarklong-term memoryinformation extractionAI evaluationvideo comprehension
0
0 comments X p. Extension
pith:Y77WGRQV Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{Y77WGRQV}

Prints a linked pith:Y77WGRQV badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Current multimodal models underperform on long video understanding tasks spanning several hours.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Recent progress has improved multimodal large language models on short videos under one minute, yet real-world uses such as embodied decision-making, movie reviews, and sports commentary require handling videos that last hours. The paper introduces LVBench, a benchmark built from publicly sourced long videos and a range of tasks for comprehension and information extraction. Extensive evaluations on the benchmark show that existing models continue to fall short in demonstrating the needed long-term memory and extended comprehension. The work positions LVBench as a tool to drive creation of models that can meet these longer-duration demands.

Core claim

LVBench is a benchmark for extreme long video understanding that uses publicly sourced videos spanning several hours together with diverse tasks for long-term memory and extended comprehension, and evaluations on it establish that current multimodal models still underperform on these demanding tasks.

What carries the argument

LVBench, a dataset of publicly sourced long videos paired with tasks for comprehension and information extraction over multi-hour durations.

If this is right

  • Models must develop stronger long-term memory mechanisms to handle multi-hour content.
  • Improved scores on LVBench would directly support applications in embodied intelligence and detailed content analysis.
  • The benchmark supplies a standardized way to measure progress toward extended video comprehension.
  • Public release of data and code enables consistent tracking of model advances on long videos.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar evaluation sets could be developed for other long-form inputs such as extended audio or document streams.
  • Tracking LVBench scores over time would reveal whether architectural changes close the gap between short and long video performance.
  • Success on the benchmark may serve as an indicator for readiness in live commentary or review generation systems.

Load-bearing premise

The chosen tasks and videos in LVBench accurately reflect the comprehension demands of real-world long video applications such as embodied decision-making and in-depth reviews.

What would settle it

A model scoring high on LVBench but failing in actual embodied decision-making on long video streams, or a practical system succeeding at hour-scale video tasks while scoring low on the benchmark.

read the original abstract

Recent progress in multimodal large language models has markedly enhanced the understanding of short videos (typically under one minute), and several evaluation datasets have emerged accordingly. However, these advancements fall short of meeting the demands of real-world applications such as embodied intelligence for long-term decision-making, in-depth movie reviews and discussions, and live sports commentary, all of which require comprehension of long videos spanning several hours. To address this gap, we introduce LVBench, a benchmark specifically designed for long video understanding. Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction. LVBench is designed to challenge multimodal models to demonstrate long-term memory and extended comprehension capabilities. Our extensive evaluations reveal that current multimodal models still underperform on these demanding long video understanding tasks. Through LVBench, we aim to spur the development of more advanced models capable of tackling the complexities of long video comprehension. Our data and code are publicly available at: https://lvbench.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces LVBench, a benchmark for extreme long video understanding. It consists of publicly sourced videos spanning several hours along with diverse tasks targeting long-term memory and extended comprehension. Evaluations of current multimodal large language models reveal underperformance on these tasks relative to short-video settings, with the goal of spurring progress on real-world applications such as embodied decision-making and in-depth movie reviews. Data and code are released publicly.

Significance. If the central evaluation results hold, the benchmark is a useful addition to the field because it targets a clear gap between existing short-video datasets and the multi-hour comprehension demands of practical applications. The public release of data and code is a concrete strength that supports reproducibility and follow-on work.

major comments (1)
  1. [Dataset construction] Dataset construction section: the manuscript provides only high-level descriptions of task design and annotation. Without explicit criteria showing that individual questions require information distributed across the full video length (rather than answerable from short local segments), it is difficult to confirm that the reported underperformance isolates long-range understanding deficits from general multimodal or short-range limitations.
minor comments (2)
  1. [Abstract] The abstract states that models 'still underperform' but does not report any concrete accuracy numbers or comparison baselines; adding one or two key quantitative results would improve the summary.
  2. [Experiments] Figure captions and axis labels in the evaluation plots should be checked for consistency with the text descriptions of the tasks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the constructive feedback and the recommendation of minor revision. We address the major comment on dataset construction below.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: the manuscript provides only high-level descriptions of task design and annotation. Without explicit criteria showing that individual questions require information distributed across the full video length (rather than answerable from short local segments), it is difficult to confirm that the reported underperformance isolates long-range understanding deficits from general multimodal or short-range limitations.

    Authors: We appreciate this point and agree that greater specificity would strengthen the manuscript. The original submission emphasized high-level task categories and overall statistics to keep the focus on benchmark scale and model evaluations. In the revised version, we will expand the Dataset Construction section with explicit annotation criteria and concrete examples. These will include guidelines requiring questions to integrate information from temporally distant segments (e.g., linking an event in the first 10 minutes to its consequence after 2 hours, or tracking cumulative state changes across the full duration). We will also report inter-annotator agreement on whether questions could be answered from short clips alone. This addition will better isolate long-range deficits while preserving the paper's core claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces LVBench as a new benchmark constructed from publicly sourced videos with tasks targeting long-term memory and extended comprehension. No equations, fitted parameters, or self-referential derivations appear in the provided text; the central claim of model underperformance is demonstrated via evaluation on this independent dataset rather than reducing to prior results by construction. The work is self-contained against external benchmarks, with public data release aligning to standard practice for resource papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities are introduced; the work relies on standard domain assumptions about what constitutes long-video comprehension.

axioms (1)
  • domain assumption Tasks designed for long video comprehension and information extraction validly measure extended memory capabilities required by real-world applications.
    This premise underpins the claim that underperformance on LVBench indicates a general limitation in current models.

pith-pipeline@v0.9.0 · 5729 in / 912 out tokens · 35421 ms · 2026-05-19T11:49:59.064439+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.

  2. MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production

    cs.MM 2026-04 unverdicted novelty 7.0

    MCSC-Bench is the first large-scale dataset for the Multimodal Context-to-Script Creation task, requiring models to select relevant shots from redundant materials, plan missing shots, and generate coherent scripts wit...

  3. TrajTok: Learning Trajectory Tokens enables better Video Understanding

    cs.CV 2026-02 unverdicted novelty 7.0

    TrajTok learns adaptive trajectory tokens for videos through a unified end-to-end segmenter, improving understanding performance and efficiency over patch-based or external-pipeline tokenizers.

  4. VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning

    cs.CV 2026-01 unverdicted novelty 7.0

    VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.

  5. SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning

    cs.CV 2025-06 conditional novelty 7.0

    SIV-Bench is a new video benchmark with 2,792 clips and 5,455 QA pairs that evaluates MLLMs on social scene understanding, state reasoning, and dynamics prediction using social relation theory.

  6. WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

    cs.CV 2025-02 unverdicted novelty 7.0

    WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.

  7. Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    cs.CV 2025-01 unverdicted novelty 7.0

    Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.

  8. MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

    cs.CL 2026-04 unverdicted novelty 6.0

    MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.

  9. One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...

  10. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    cs.CV 2025-07 unverdicted novelty 6.0

    GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.

  11. VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

    cs.CV 2024-12 unverdicted novelty 6.0

    VideoChat-Flash applies hierarchical video token compression to achieve ~50x reduction in context length for long videos while maintaining near-original performance on long-context benchmarks.

  12. Seed1.8 Model Card: Towards Generalized Real-World Agency

    cs.AI 2026-03 unverdicted novelty 5.0

    Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.

  13. Kimi K2.5: Visual Agentic Intelligence

    cs.CL 2026-02 unverdicted novelty 5.0

    Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.

  14. MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

    cs.LG 2025-09 unverdicted novelty 5.0

    An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.

  15. Training-Free Multimodal Large Language Model Orchestration

    cs.CL 2025-08 unverdicted novelty 5.0

    A training-free orchestration framework integrates off-the-shelf modality experts via an LLM controller, text-centric cross-modal memory, and unified interaction layer to enable multimodal input-output without joint training.

  16. CogVLM2: Visual Language Models for Image and Video Understanding

    cs.CV 2024-08 conditional novelty 5.0

    CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.

  17. EasyVideoR1: Easier RL for Video Understanding

    cs.CV 2026-04 unverdicted novelty 4.0

    EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

  18. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    cs.CL 2025-07 unverdicted novelty 4.0

    Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.

  19. Seed1.5-VL Technical Report

    cs.CV 2025-05 unverdicted novelty 4.0

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

  20. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    cs.CV 2025-01 unverdicted novelty 4.0

    VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 20 Pith papers · 22 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736,

  3. [3]

    The claude 3 model family: Opus, sonnet, haiku

    Anthropic. The claude 3 model family: Opus, sonnet, haiku

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 6

  5. [5]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330,

  6. [6]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 2, 6

  7. [7]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025. 6

  8. [8]

    Glm: General language model pretraining with autoregressive blank infilling

    Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021. 1, 5

  9. [9]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever compre- hensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 3

  10. [10]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report. arXiv preprint arXiv:2505.07062, 2025. 6

  11. [11]

    Cogagent: A visual language model for gui agents

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023. 1

  12. [12]

    CogVLM2: Visual Language Models for Image and Video Understanding

    Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Jun- hui Ji, Zhao Xue, et al. Cogvlm2: Visual language mod- els for image and video understanding. arXiv preprint arXiv:2408.16500, 2024. 2, 6

  13. [13]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning. arXiv preprint arXiv:2507.01006, 2025. 6

  14. [14]

    Tgif-qa: Toward spatio-temporal reasoning in visual question answering

    Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 2758–2766, 2017. 3

  15. [15]

    Seed-bench-2: Benchmarking multimodal large language models, 2023

    Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench-2: Benchmarking multimodal large language models, 2023. 3

  16. [16]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6

  17. [17]

    MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. arXiv preprint arXiv:2311.17005, 2023. 3

  18. [18]

    Hero: Hierarchical encoder for video+ language omni-representation pre-training

    Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200, 2020. 3

  19. [19]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043, 2023. 6

  20. [20]

    World Model on Million-Length Video And Language With Blockwise RingAttention

    Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268 , 2024. 2, 6

  21. [21]

    Kangaroo: A powerful video-language model supporting long-context video input

    Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xi- aoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, and Jie Hu. Kangaroo: A powerful video-language model supporting long-context video input. arXiv preprint arXiv:2408.15542,

  22. [22]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 3

  23. [23]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. 6

  24. [24]

    Egoschema: A diagnostic benchmark for very long- form video language understanding

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. Advances in Neural In- formation Processing Systems, 36, 2024. 3

  25. [25]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023. 1

  26. [26]

    OpenAI. Gpt-4o. 2024. 1, 6 9

  27. [27]

    OpenAI. Gpt-4.1. 2025. 6

  28. [28]

    Ziqi Pang and Yu-Xiong Wang. Mr. video:” mapreduce” is the principle for long video understanding. arXiv preprint arXiv:2504.16082, 2025. 6

  29. [29]

    Perception test: A diagnostic benchmark for multimodal video models

    Viorica P ˘atr˘aucean, Lucas Smaira, Ankush Gupta, Adri`a Re- casens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osin- dero, Dima Da...

  30. [30]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 1

  31. [31]

    CinePile: A Long Video Question Answering Dataset and Benchmark.arXiv preprint arXiv:2405.08813, 2024

    Ruchit Rawal, Khalid Saifullah, Ronen Basri, David Jacobs, Gowthami Somepalli, and Tom Goldstein. Cinepile: A long video question answering dataset and benchmark. arXiv preprint arXiv:2405.08813, 2024. 3

  32. [32]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrit- twieser, et al. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 1, 6

  33. [33]

    Timechat: A time-sensitive multimodal large lan- guage model for long video understanding

    Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large lan- guage model for long video understanding. arXiv preprint arXiv:2312.02051, 2023. 6

  34. [34]

    Moviechat: From dense to- ken to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, et al. Moviechat: From dense to- ken to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449, 2023. 2, 3, 6

  35. [35]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. 1

  36. [36]

    Movieqa: Understanding stories in movies through question- answering

    Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Movieqa: Understanding stories in movies through question- answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4631–4640,

  37. [37]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 1, 2, 6

  38. [38]

    CogVLM: Visual Expert for Pretrained Language Models

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023. 1

  39. [39]

    Adaretake: Adaptive redundancy reduction to perceive longer for video-language understanding

    Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, and Liqiang Nie. Adaretake: Adaptive redundancy reduction to perceive longer for video-language understanding. arXiv preprint arXiv:2503.12559, 2025. 6

  40. [40]

    LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context inter- leaved video-language understanding. arXiv preprint arXiv:2407.15754, 2024. 3

  41. [41]

    Next-qa: Next phase of question-answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 9777–9786, 2021. 3

  42. [42]

    Video question answer- ing via gradually refined attention over appearance and mo- tion

    Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answer- ing via gradually refined attention over appearance and mo- tion. In Proceedings of the 25th ACM international confer- ence on Multimedia, pages 1645–1653, 2017. 3

  43. [43]

    PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

    Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning. arXiv preprint arXiv:2404.16994, 2024. 2, 6

  44. [44]

    mplug-owl3: Towards long image-sequence understanding in multi-modal large language models

    Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models. arXiv preprint arXiv:2408.04840, 2024. 6

  45. [45]

    Activitynet-qa: A dataset for understanding complex web videos via question answering

    Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yuet- ing Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 9127–9134, 2019. 3

  46. [46]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023. 1

  47. [47]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing. arXiv preprint arXiv:2501.13106, 2025. 6

  48. [48]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. arXiv preprint arXiv:2306.02858, 2023. 2

  49. [49]

    Movqa: A benchmark of versatile question-answering for long-form movie understanding

    Hongjie Zhang, Yi Liu, Lu Dong, Yifei Huang, Zhen-Hua Ling, Yali Wang, Limin Wang, and Yu Qiao. Movqa: A benchmark of versatile question-answering for long-form movie understanding. arXiv preprint arXiv:2312.04817 ,

  50. [50]

    video info.meta.jsonl

    Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 2, 6 10 A. Datasheet A.1. Motivation • For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide ...