LVBench: An Extreme Long Video Understanding Benchmark

arxiv: 2406.08035 · v3 · pith:Y77WGRQVnew · submitted 2024-06-12 · 💻 cs.CV · cs.AI

LVBench: An Extreme Long Video Understanding Benchmark

Weihan Wang , Zehai He , Wenyi Hong , Yean Cheng , Xiaohan Zhang , Ji Qi , Xiaotao Gu , Shiyu Huang

show 4 more authors

Bin Xu Yuxiao Dong Ming Ding Jie Tang

This is my paper

Pith reviewed 2026-05-19 11:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords long video understandingmultimodal large language modelsvideo benchmarklong-term memoryinformation extractionAI evaluationvideo comprehension

0 comments p. Extension

pith:Y77WGRQV Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{Y77WGRQV}

Prints a linked pith:Y77WGRQV badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Current multimodal models underperform on long video understanding tasks spanning several hours.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Recent progress has improved multimodal large language models on short videos under one minute, yet real-world uses such as embodied decision-making, movie reviews, and sports commentary require handling videos that last hours. The paper introduces LVBench, a benchmark built from publicly sourced long videos and a range of tasks for comprehension and information extraction. Extensive evaluations on the benchmark show that existing models continue to fall short in demonstrating the needed long-term memory and extended comprehension. The work positions LVBench as a tool to drive creation of models that can meet these longer-duration demands.

Core claim

LVBench is a benchmark for extreme long video understanding that uses publicly sourced videos spanning several hours together with diverse tasks for long-term memory and extended comprehension, and evaluations on it establish that current multimodal models still underperform on these demanding tasks.

What carries the argument

LVBench, a dataset of publicly sourced long videos paired with tasks for comprehension and information extraction over multi-hour durations.

If this is right

Models must develop stronger long-term memory mechanisms to handle multi-hour content.
Improved scores on LVBench would directly support applications in embodied intelligence and detailed content analysis.
The benchmark supplies a standardized way to measure progress toward extended video comprehension.
Public release of data and code enables consistent tracking of model advances on long videos.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar evaluation sets could be developed for other long-form inputs such as extended audio or document streams.
Tracking LVBench scores over time would reveal whether architectural changes close the gap between short and long video performance.
Success on the benchmark may serve as an indicator for readiness in live commentary or review generation systems.

Load-bearing premise

The chosen tasks and videos in LVBench accurately reflect the comprehension demands of real-world long video applications such as embodied decision-making and in-depth reviews.

What would settle it

A model scoring high on LVBench but failing in actual embodied decision-making on long video streams, or a practical system succeeding at hour-scale video tasks while scoring low on the benchmark.

read the original abstract

Recent progress in multimodal large language models has markedly enhanced the understanding of short videos (typically under one minute), and several evaluation datasets have emerged accordingly. However, these advancements fall short of meeting the demands of real-world applications such as embodied intelligence for long-term decision-making, in-depth movie reviews and discussions, and live sports commentary, all of which require comprehension of long videos spanning several hours. To address this gap, we introduce LVBench, a benchmark specifically designed for long video understanding. Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction. LVBench is designed to challenge multimodal models to demonstrate long-term memory and extended comprehension capabilities. Our extensive evaluations reveal that current multimodal models still underperform on these demanding long video understanding tasks. Through LVBench, we aim to spur the development of more advanced models capable of tackling the complexities of long video comprehension. Our data and code are publicly available at: https://lvbench.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LVBench is a new benchmark for hour-long videos that shows current models still lag on long-memory tasks, with public data as a plus but thin details on how the tasks were built.

read the letter

The main thing to know is that this paper brings out LVBench, a benchmark for long videos that go on for hours, and shows that today's models aren't great at the long-memory tasks they set up. It's new in targeting those extreme lengths and the associated comprehension demands, distinct from the short-clip focus in prior work. They've done a good job pulling together public videos and making tasks that touch on things like robot decision making over time or full movie reviews. Releasing the data and code is the right move for something like this, and the evaluations back up the claim that models still have room to improve. The part that feels a bit thin is the lack of detail on building the tasks. It's not obvious how they made sure the questions force you to use the whole video instead of just bits of it. More on the annotation process and any checks for that would make the results more solid. But the overall argument doesn't have obvious holes. This is for folks who care about video understanding beyond the one-minute mark. If you're testing models for real applications with long content, this gives you something new to try and could help guide future work. I'd say send it to referees. It's a practical addition that can help organize work in this area, and the community would benefit from the discussion.

Referee Report

1 major / 2 minor

Summary. The paper introduces LVBench, a benchmark for extreme long video understanding. It consists of publicly sourced videos spanning several hours along with diverse tasks targeting long-term memory and extended comprehension. Evaluations of current multimodal large language models reveal underperformance on these tasks relative to short-video settings, with the goal of spurring progress on real-world applications such as embodied decision-making and in-depth movie reviews. Data and code are released publicly.

Significance. If the central evaluation results hold, the benchmark is a useful addition to the field because it targets a clear gap between existing short-video datasets and the multi-hour comprehension demands of practical applications. The public release of data and code is a concrete strength that supports reproducibility and follow-on work.

major comments (1)

[Dataset construction] Dataset construction section: the manuscript provides only high-level descriptions of task design and annotation. Without explicit criteria showing that individual questions require information distributed across the full video length (rather than answerable from short local segments), it is difficult to confirm that the reported underperformance isolates long-range understanding deficits from general multimodal or short-range limitations.

minor comments (2)

[Abstract] The abstract states that models 'still underperform' but does not report any concrete accuracy numbers or comparison baselines; adding one or two key quantitative results would improve the summary.
[Experiments] Figure captions and axis labels in the evaluation plots should be checked for consistency with the text descriptions of the tasks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the constructive feedback and the recommendation of minor revision. We address the major comment on dataset construction below.

read point-by-point responses

Referee: [Dataset construction] Dataset construction section: the manuscript provides only high-level descriptions of task design and annotation. Without explicit criteria showing that individual questions require information distributed across the full video length (rather than answerable from short local segments), it is difficult to confirm that the reported underperformance isolates long-range understanding deficits from general multimodal or short-range limitations.

Authors: We appreciate this point and agree that greater specificity would strengthen the manuscript. The original submission emphasized high-level task categories and overall statistics to keep the focus on benchmark scale and model evaluations. In the revised version, we will expand the Dataset Construction section with explicit annotation criteria and concrete examples. These will include guidelines requiring questions to integrate information from temporally distant segments (e.g., linking an event in the first 10 minutes to its consequence after 2 hours, or tracking cumulative state changes across the full duration). We will also report inter-annotator agreement on whether questions could be answered from short clips alone. This addition will better isolate long-range deficits while preserving the paper's core claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces LVBench as a new benchmark constructed from publicly sourced videos with tasks targeting long-term memory and extended comprehension. No equations, fitted parameters, or self-referential derivations appear in the provided text; the central claim of model underperformance is demonstrated via evaluation on this independent dataset rather than reducing to prior results by construction. The work is self-contained against external benchmarks, with public data release aligning to standard practice for resource papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities are introduced; the work relies on standard domain assumptions about what constitutes long-video comprehension.

axioms (1)

domain assumption Tasks designed for long video comprehension and information extraction validly measure extended memory capabilities required by real-world applications.
This premise underpins the claim that underperformance on LVBench indicates a general limitation in current models.

pith-pipeline@v0.9.0 · 5729 in / 912 out tokens · 35421 ms · 2026-05-19T11:49:59.064439+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define long videos as those having a minimum duration of 30 minutes … six core skills: Temporal Grounding (TG), Summarization (Sum), Reasoning (Rea), Entity Recognition (ER), Event Understanding (EU), Key Information Retrieval (KIR).
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our extensive evaluations reveal that current multimodal models still underperform on these demanding long video understanding tasks.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.
MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production
cs.MM 2026-04 unverdicted novelty 7.0

MCSC-Bench is the first large-scale dataset for the Multimodal Context-to-Script Creation task, requiring models to select relevant shots from redundant materials, plan missing shots, and generate coherent scripts wit...
TrajTok: Learning Trajectory Tokens enables better Video Understanding
cs.CV 2026-02 unverdicted novelty 7.0

TrajTok learns adaptive trajectory tokens for videos through a unified end-to-end segmenter, improving understanding performance and efficiency over patch-based or external-pipeline tokenizers.
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
cs.CV 2026-01 unverdicted novelty 7.0

VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning
cs.CV 2025-06 conditional novelty 7.0

SIV-Bench is a new video benchmark with 2,792 clips and 5,455 QA pairs that evaluates MLLMs on social scene understanding, state reasoning, and dynamics prediction using social relation theory.
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
cs.CV 2025-02 unverdicted novelty 7.0

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
cs.CV 2025-01 unverdicted novelty 7.0

Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
cs.CL 2026-04 unverdicted novelty 6.0

MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
cs.CV 2025-07 unverdicted novelty 6.0

GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
cs.CV 2024-12 unverdicted novelty 6.0

VideoChat-Flash applies hierarchical video token compression to achieve ~50x reduction in context length for long videos while maintaining near-original performance on long-context benchmarks.
Seed1.8 Model Card: Towards Generalized Real-World Agency
cs.AI 2026-03 unverdicted novelty 5.0

Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
Kimi K2.5: Visual Agentic Intelligence
cs.CL 2026-02 unverdicted novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
cs.LG 2025-09 unverdicted novelty 5.0

An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.
Training-Free Multimodal Large Language Model Orchestration
cs.CL 2025-08 unverdicted novelty 5.0

A training-free orchestration framework integrates off-the-shelf modality experts via an LLM controller, text-centric cross-modal memory, and unified interaction layer to enable multimodal input-output without joint training.
CogVLM2: Visual Language Models for Image and Video Understanding
cs.CV 2024-08 conditional novelty 5.0

CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.
EasyVideoR1: Easier RL for Video Understanding
cs.CV 2026-04 unverdicted novelty 4.0

EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
cs.CL 2025-07 unverdicted novelty 4.0

Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 20 Pith papers · 22 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736,

work page
[3]

The claude 3 model family: Opus, sonnet, haiku

Anthropic. The claude 3 model family: Opus, sonnet, haiku

work page
[4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Glm: General language model pretraining with autoregressive blank infilling

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021. 1, 5

work page arXiv 2021
[9]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever compre- hensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report. arXiv preprint arXiv:2505.07062, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023. 1

work page arXiv 2023
[12]

CogVLM2: Visual Language Models for Image and Video Understanding

Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Jun- hui Ji, Zhao Xue, et al. Cogvlm2: Visual language mod- els for image and video understanding. arXiv preprint arXiv:2408.16500, 2024. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning. arXiv preprint arXiv:2507.01006, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Tgif-qa: Toward spatio-temporal reasoning in visual question answering

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 2758–2766, 2017. 3

work page 2017
[15]

Seed-bench-2: Benchmarking multimodal large language models, 2023

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench-2: Benchmarking multimodal large language models, 2023. 3

work page 2023
[16]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. arXiv preprint arXiv:2311.17005, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Hero: Hierarchical encoder for video+ language omni-representation pre-training

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200, 2020. 3

work page arXiv 2005
[19]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043, 2023. 6

work page arXiv 2023
[20]

World Model on Million-Length Video And Language With Blockwise RingAttention

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268 , 2024. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Kangaroo: A powerful video-language model supporting long-context video input

Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xi- aoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, and Jie Hu. Kangaroo: A powerful video-language model supporting long-context video input. arXiv preprint arXiv:2408.15542,

work page arXiv
[22]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Egoschema: A diagnostic benchmark for very long- form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. Advances in Neural In- formation Processing Systems, 36, 2024. 3

work page 2024
[25]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023. 1

work page 2023
[26]

OpenAI. Gpt-4o. 2024. 1, 6 9

work page 2024
[27]

OpenAI. Gpt-4.1. 2025. 6

work page 2025
[28]

Ziqi Pang and Yu-Xiong Wang. Mr. video:” mapreduce” is the principle for long video understanding. arXiv preprint arXiv:2504.16082, 2025. 6

work page arXiv 2025
[29]

Perception test: A diagnostic benchmark for multimodal video models

Viorica P ˘atr˘aucean, Lucas Smaira, Ankush Gupta, Adri`a Re- casens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osin- dero, Dima Da...

work page 2023
[30]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 1

work page 2021
[31]

CinePile: A Long Video Question Answering Dataset and Benchmark.arXiv preprint arXiv:2405.08813, 2024

Ruchit Rawal, Khalid Saifullah, Ronen Basri, David Jacobs, Gowthami Somepalli, and Tom Goldstein. Cinepile: A long video question answering dataset and benchmark. arXiv preprint arXiv:2405.08813, 2024. 3

work page arXiv 2024
[32]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrit- twieser, et al. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Timechat: A time-sensitive multimodal large lan- guage model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large lan- guage model for long video understanding. arXiv preprint arXiv:2312.02051, 2023. 6

work page arXiv 2023
[34]

Moviechat: From dense to- ken to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, et al. Moviechat: From dense to- ken to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449, 2023. 2, 3, 6

work page arXiv 2023
[35]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Movieqa: Understanding stories in movies through question- answering

Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Movieqa: Understanding stories in movies through question- answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4631–4640,

work page
[37]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 1, 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Adaretake: Adaptive redundancy reduction to perceive longer for video-language understanding

Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, and Liqiang Nie. Adaretake: Adaptive redundancy reduction to perceive longer for video-language understanding. arXiv preprint arXiv:2503.12559, 2025. 6

work page arXiv 2025
[40]

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context inter- leaved video-language understanding. arXiv preprint arXiv:2407.15754, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 9777–9786, 2021. 3

work page 2021
[42]

Video question answer- ing via gradually refined attention over appearance and mo- tion

Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answer- ing via gradually refined attention over appearance and mo- tion. In Proceedings of the 25th ACM international confer- ence on Multimedia, pages 1645–1653, 2017. 3

work page 2017
[43]

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning. arXiv preprint arXiv:2404.16994, 2024. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

mplug-owl3: Towards long image-sequence understanding in multi-modal large language models

Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models. arXiv preprint arXiv:2408.04840, 2024. 6

work page arXiv 2024
[45]

Activitynet-qa: A dataset for understanding complex web videos via question answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yuet- ing Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 9127–9134, 2019. 3

work page 2019
[46]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023. 1

work page 2023
[47]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing. arXiv preprint arXiv:2501.13106, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. arXiv preprint arXiv:2306.02858, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Movqa: A benchmark of versatile question-answering for long-form movie understanding

Hongjie Zhang, Yi Liu, Lu Dong, Yifei Huang, Zhen-Hua Ling, Yali Wang, Limin Wang, and Yu Qiao. Movqa: A benchmark of versatile question-answering for long-form movie understanding. arXiv preprint arXiv:2312.04817 ,

work page arXiv
[50]

video info.meta.jsonl

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 2, 6 10 A. Datasheet A.1. Motivation • For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide ...

work page 2024

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736,

work page

[3] [3]

The claude 3 model family: Opus, sonnet, haiku

Anthropic. The claude 3 model family: Opus, sonnet, haiku

work page

[4] [4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Glm: General language model pretraining with autoregressive blank infilling

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021. 1, 5

work page arXiv 2021

[9] [9]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever compre- hensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report. arXiv preprint arXiv:2505.07062, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023. 1

work page arXiv 2023

[12] [12]

CogVLM2: Visual Language Models for Image and Video Understanding

Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Jun- hui Ji, Zhao Xue, et al. Cogvlm2: Visual language mod- els for image and video understanding. arXiv preprint arXiv:2408.16500, 2024. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning. arXiv preprint arXiv:2507.01006, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Tgif-qa: Toward spatio-temporal reasoning in visual question answering

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 2758–2766, 2017. 3

work page 2017

[15] [15]

Seed-bench-2: Benchmarking multimodal large language models, 2023

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench-2: Benchmarking multimodal large language models, 2023. 3

work page 2023

[16] [16]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. arXiv preprint arXiv:2311.17005, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Hero: Hierarchical encoder for video+ language omni-representation pre-training

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200, 2020. 3

work page arXiv 2005

[19] [19]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043, 2023. 6

work page arXiv 2023

[20] [20]

World Model on Million-Length Video And Language With Blockwise RingAttention

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268 , 2024. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Kangaroo: A powerful video-language model supporting long-context video input

Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xi- aoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, and Jie Hu. Kangaroo: A powerful video-language model supporting long-context video input. arXiv preprint arXiv:2408.15542,

work page arXiv

[22] [22]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Egoschema: A diagnostic benchmark for very long- form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. Advances in Neural In- formation Processing Systems, 36, 2024. 3

work page 2024

[25] [25]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023. 1

work page 2023

[26] [26]

OpenAI. Gpt-4o. 2024. 1, 6 9

work page 2024

[27] [27]

OpenAI. Gpt-4.1. 2025. 6

work page 2025

[28] [28]

Ziqi Pang and Yu-Xiong Wang. Mr. video:” mapreduce” is the principle for long video understanding. arXiv preprint arXiv:2504.16082, 2025. 6

work page arXiv 2025

[29] [29]

Perception test: A diagnostic benchmark for multimodal video models

Viorica P ˘atr˘aucean, Lucas Smaira, Ankush Gupta, Adri`a Re- casens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osin- dero, Dima Da...

work page 2023

[30] [30]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 1

work page 2021

[31] [31]

CinePile: A Long Video Question Answering Dataset and Benchmark.arXiv preprint arXiv:2405.08813, 2024

Ruchit Rawal, Khalid Saifullah, Ronen Basri, David Jacobs, Gowthami Somepalli, and Tom Goldstein. Cinepile: A long video question answering dataset and benchmark. arXiv preprint arXiv:2405.08813, 2024. 3

work page arXiv 2024

[32] [32]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrit- twieser, et al. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Timechat: A time-sensitive multimodal large lan- guage model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large lan- guage model for long video understanding. arXiv preprint arXiv:2312.02051, 2023. 6

work page arXiv 2023

[34] [34]

Moviechat: From dense to- ken to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, et al. Moviechat: From dense to- ken to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449, 2023. 2, 3, 6

work page arXiv 2023

[35] [35]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

Movieqa: Understanding stories in movies through question- answering

Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Movieqa: Understanding stories in movies through question- answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4631–4640,

work page

[37] [37]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 1, 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Adaretake: Adaptive redundancy reduction to perceive longer for video-language understanding

Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, and Liqiang Nie. Adaretake: Adaptive redundancy reduction to perceive longer for video-language understanding. arXiv preprint arXiv:2503.12559, 2025. 6

work page arXiv 2025

[40] [40]

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context inter- leaved video-language understanding. arXiv preprint arXiv:2407.15754, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 9777–9786, 2021. 3

work page 2021

[42] [42]

Video question answer- ing via gradually refined attention over appearance and mo- tion

Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answer- ing via gradually refined attention over appearance and mo- tion. In Proceedings of the 25th ACM international confer- ence on Multimedia, pages 1645–1653, 2017. 3

work page 2017

[43] [43]

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning. arXiv preprint arXiv:2404.16994, 2024. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

mplug-owl3: Towards long image-sequence understanding in multi-modal large language models

Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models. arXiv preprint arXiv:2408.04840, 2024. 6

work page arXiv 2024

[45] [45]

Activitynet-qa: A dataset for understanding complex web videos via question answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yuet- ing Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 9127–9134, 2019. 3

work page 2019

[46] [46]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023. 1

work page 2023

[47] [47]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing. arXiv preprint arXiv:2501.13106, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. arXiv preprint arXiv:2306.02858, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

Movqa: A benchmark of versatile question-answering for long-form movie understanding

Hongjie Zhang, Yi Liu, Lu Dong, Yifei Huang, Zhen-Hua Ling, Yali Wang, Limin Wang, and Yu Qiao. Movqa: A benchmark of versatile question-answering for long-form movie understanding. arXiv preprint arXiv:2312.04817 ,

work page arXiv

[50] [50]

video info.meta.jsonl

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 2, 6 10 A. Datasheet A.1. Motivation • For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide ...

work page 2024