Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Caifeng Shan; Chaoyou Fu; Chenyu Zhou; Enhong Chen; Ke Li; Lei Li; Mengdan Zhang; Peixian Chen; Ran He; Renrui Zhang

arxiv: 2405.21075 · v3 · submitted 2024-05-31 · 💻 cs.CV · cs.CL

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu , Yuhan Dai , Yongdong Luo , Lei Li , Shuhuai Ren , Renrui Zhang , Zihan Wang , Chenyu Zhou

show 13 more authors

Yunhang Shen Mengdan Zhang Peixian Chen Yanwei Li Shaohui Lin Sirui Zhao Ke Li Tong Xu Xiawu Zheng Enhong Chen Caifeng Shan Ran He Xing Sun

This is my paper

Pith reviewed 2026-05-11 01:54 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords video benchmarkmulti-modal LLMsvideo analysisMLLM evaluationlong video understandingmulti-modal inputsexpert annotationstemporal dynamics

0 comments

The pith

Video-MME introduces the first full-spectrum benchmark to test multi-modal LLMs on videos from seconds to an hour with audio and subtitles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates Video-MME to fill the gap in evaluating MLLMs for sequential video data instead of static images. It assembles 900 manually selected and expert-annotated videos totaling 254 hours, spanning six visual domains, three duration categories, and multiple input types including frames, subtitles, and audio. Experiments on models such as Gemini 1.5 Pro and various open-source systems show that current leaders still fall short on long sequences and integrated modalities. A sympathetic reader would care because reliable video understanding is a necessary step toward broader AI capabilities that handle real-world temporal and multi-sensory information. The work therefore supplies a concrete tool for measuring and directing progress on these fronts.

Core claim

We introduce Video-MME as the first comprehensive benchmark for MLLMs in video analysis, built from 900 videos across 6 primary domains and 30 subfields, covering short, medium, and long durations up to one hour, and incorporating subtitles and audio in addition to visual frames. All 2,700 question-answer pairs were created through repeated expert manual review. Evaluations of commercial and open-source models indicate that Gemini 1.5 Pro leads but that substantial gaps remain in handling extended temporal contexts and multi-modal data.

What carries the argument

The Video-MME benchmark itself, defined by its four design axes of domain diversity, temporal duration range, multi-modal input breadth, and expert manual annotation quality.

If this is right

Current MLLMs require targeted improvements to process videos longer than a few minutes while retaining context.
Incorporating audio and subtitle streams alongside frames remains only partially solved even in leading commercial models.
Open-source video MLLMs trail commercial systems such as Gemini 1.5 Pro and GPT-4 variants on the tested tasks.
The benchmark can be used to track whether future models close the identified gaps in long-sequence and multi-modal handling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Wider adoption of Video-MME could shift research focus from image-only pretraining toward video-native architectures.
If model rankings on this benchmark predict performance in downstream applications such as video search or summarization, then closing the long-video gap would yield immediate practical gains.
Extending the same annotation protocol to even longer or more noisy real-world videos could expose further limitations not visible in the current 254-hour set.

Load-bearing premise

The 900 chosen videos and the expert annotations are assumed to represent a broad, unbiased sample of real video analysis challenges without major selection or labeling bias.

What would settle it

A large-scale re-annotation of the same videos by an independent expert group or the addition of thousands of new videos from the same domains that produces substantially different model rankings or removes the observed performance gaps on long videos.

read the original abstract

In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs in processing sequential visual data is still insufficiently explored, highlighting the absence of a comprehensive, high-quality assessment of their performance. In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. Our work distinguishes from existing benchmarks through four key features: 1) Diversity in video types, spanning 6 primary visual domains with 30 subfields to ensure broad scenario generalizability; 2) Duration in temporal dimension, encompassing both short-, medium-, and long-term videos, ranging from 11 seconds to 1 hour, for robust contextual dynamics; 3) Breadth in data modalities, integrating multi-modal inputs besides video frames, including subtitles and audios, to unveil the all-round capabilities of MLLMs; 4) Quality in annotations, utilizing rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. 900 videos with a total of 254 hours are manually selected and annotated by repeatedly viewing all the video content, resulting in 2,700 question-answer pairs. With Video-MME, we extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models like InternVL-Chat-V1.5 and video models like LLaVA-NeXT-Video. Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models. Our dataset along with these findings underscores the need for further improvements in handling longer sequences and multi-modal data. Project Page: https://video-mme.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Video-MME is a timely new benchmark that widens video MLLM evaluation to long clips and audio/subtitles, but its claims rest on undocumented annotation quality.

read the letter

The paper's core contribution is a new dataset of 900 videos totaling 254 hours, turned into 2700 QA pairs. It spans six visual domains, video lengths from 11 seconds to one hour, and inputs that include subtitles and audio alongside frames. They run it on GPT-4 variants, Gemini 1.5 Pro, and several open-source models, showing Gemini ahead and everyone struggling more on longer sequences and non-visual modalities. That combination of scope and initial results is the useful part; prior video benchmarks were narrower on duration or modality coverage. The work is straightforward empirical dataset construction with no fitted parameters or circular claims. What stands out is the explicit push to test multi-modal inputs beyond frames, which matches real deployment needs. The soft spots are in the construction details. The abstract and available text describe repeated expert viewing and manual selection but give no inter-annotator agreement numbers, no breakdown of how questions were balanced for difficulty, and no explicit checks against selection bias in the 900 videos. Without those, the reliability of the 2700 pairs as a stable yardstick is harder to judge. The model comparisons are descriptive rather than diagnostic, so they flag gaps without explaining why certain failures occur. This paper is mainly for groups actively building or benchmarking video MLLMs who need a broader test set than existing short-clip options. It is worth sending to peer review because the field lacks this kind of coverage and the dataset itself can be stress-tested by others once released. Revisions would likely focus on adding the missing validation statistics rather than rethinking the overall design.

Referee Report

1 major / 2 minor

Summary. The paper introduces Video-MME, the first full-spectrum benchmark for multi-modal LLMs in video analysis. It comprises 900 manually selected videos (254 hours total) spanning 6 primary domains and 30 subfields, with durations from 11 seconds to 1 hour, incorporating video frames plus subtitles and audio. Expert annotators produced 2,700 QA pairs via repeated viewing. Experiments evaluate commercial models (GPT-4 series, Gemini 1.5 Pro) and open-source models (InternVL-Chat-V1.5, LLaVA-NeXT-Video), with Gemini 1.5 Pro performing best; the results highlight needs for better long-sequence and multi-modal handling.

Significance. If the construction details hold, Video-MME would be a significant contribution by providing the first benchmark that jointly stresses domain diversity, temporal scale (including hour-long videos), and multi-modal inputs for MLLM video evaluation. This directly addresses the image-centric focus of prior MLLM benchmarks and supplies concrete evidence of current model limitations, potentially serving as a standard testbed to guide future work on temporal reasoning and audio-subtitle integration.

major comments (1)

[§3] §3 (Benchmark Construction): The central claim of 'quality in annotations' for 'precise and reliable model assessment' rests on manual expert labeling of 900 videos and 2,700 QA pairs, yet the manuscript provides no quantitative inter-annotator agreement scores, number of annotators per video, or explicit controls for question difficulty and selection bias. This directly affects the soundness of all downstream model comparisons.

minor comments (2)

[Abstract] Abstract and §1: The 'first-ever full-spectrum' phrasing is scoped to the four listed features, but a brief explicit comparison table against the closest prior video benchmarks (e.g., those limited to short clips or single modalities) would strengthen the novelty claim without altering the core contribution.
[§4] §4 (Experiments): The reported model rankings would benefit from error bars or statistical significance tests on the accuracy differences, especially given the varying video lengths and modalities.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address the single major comment on annotation quality below.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The central claim of 'quality in annotations' for 'precise and reliable model assessment' rests on manual expert labeling of 900 videos and 2,700 QA pairs, yet the manuscript provides no quantitative inter-annotator agreement scores, number of annotators per video, or explicit controls for question difficulty and selection bias. This directly affects the soundness of all downstream model comparisons.

Authors: We agree that the current version of the manuscript does not report quantitative inter-annotator agreement scores, the precise number of annotators per video, or explicit protocols for controlling question difficulty and selection bias. The annotation description emphasizes expert annotators who repeatedly viewed each video in full to generate the QA pairs, which was intended to ensure reliability. In the revised manuscript we will expand §3 with additional details on the annotation workflow, including the number of annotators involved, the guidelines provided to annotators for balancing difficulty and avoiding bias, and any available agreement or consensus metrics from the process. This will directly address the concern and strengthen the justification for the benchmark's use in model comparisons. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark introduction is self-contained empirical work

full rationale

The paper introduces Video-MME as a new benchmark via manual video selection (900 videos, 254 hours) and expert annotation (2,700 QA pairs), then reports empirical evaluations of external MLLMs (GPT-4, Gemini, InternVL, LLaVA-NeXT-Video). No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. The four distinguishing features (diversity, duration, modalities, quality) are descriptive of the construction process rather than derived quantities. Claims of 'first-ever full-spectrum' are scoped to the listed criteria and do not rely on self-citation chains or reductions to inputs. The work is self-contained against external model testing and stated annotation protocols, with no load-bearing steps that collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about what constitutes a high-quality benchmark rather than new mathematical constructs or fitted parameters.

axioms (1)

domain assumption Rigorous manual labeling by expert annotators produces precise and reliable question-answer pairs.
Invoked when describing the annotation process and quality feature.

pith-pipeline@v0.9.0 · 5728 in / 1146 out tokens · 46112 ms · 2026-05-11T01:54:47.057447+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
cs.CV 2026-05 unverdicted novelty 8.0

VEBENCH is the first benchmark evaluating LMMs on video editing technique recognition and operation simulation using 3.9K videos and 3,080 QA pairs, revealing a large performance gap to humans.
ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs
cs.AI 2026-05 unverdicted novelty 7.0

ST-SimDiff is a training-free method using a spatio-temporal graph and dual similarity-difference selection to compress video tokens for MLLMs while retaining static and dynamic content.
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
cs.CV 2026-05 unverdicted novelty 7.0

FineBench is a new dense VQA benchmark for fine-grained human activity understanding in long videos, revealing weaknesses in open VLMs and showing that FineAgent improves them via localization and description modules.
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
cs.CV 2026-05 unverdicted novelty 7.0

FineBench is a new dense VQA benchmark for fine-grained human activity in long videos that exposes weaknesses in open VLMs and demonstrates gains from the proposed FineAgent modular framework.
Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.
CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
cs.CV 2026-05 unverdicted novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 7.0

Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
cs.CV 2026-05 unverdicted novelty 7.0

VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.
FCMBench-Video: Benchmarking Document Video Intelligence
cs.CV 2026-04 unverdicted novelty 7.0

FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.
Topology-Aware Layer Pruning for Large Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

A topology-aware pruning framework models layer representation evolution in LVLMs via simplicial complexes and zigzag persistent homology to enable adaptive removal of layers while outperforming existing methods on mu...
Mosaic: Cross-Modal Clustering for Efficient Video Understanding
cs.PF 2026-04 unverdicted novelty 7.0

Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

CrashSight is a new infrastructure-focused benchmark showing that state-of-the-art vision-language models can describe crash scenes but fail at temporal and causal reasoning.
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
cs.CV 2026-04 unverdicted novelty 7.0

AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models
cs.CV 2026-04 unverdicted novelty 7.0

VSAS-Bench offers temporally dense annotations and synchronous/asynchronous protocols to evaluate streaming VLMs on timeliness, consistency, accuracy, and latency trade-offs, showing that adapted conventional VLMs can...
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
cs.CV 2026-04 unverdicted novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
cs.CV 2026-03 unverdicted novelty 7.0

SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
TrajTok: Learning Trajectory Tokens enables better Video Understanding
cs.CV 2026-02 unverdicted novelty 7.0

TrajTok learns adaptive trajectory tokens for videos through a unified end-to-end segmenter, improving understanding performance and efficiency over patch-based or external-pipeline tokenizers.
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
cs.CV 2026-01 unverdicted novelty 7.0

VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly
cs.CL 2025-09 unverdicted novelty 7.0

ProMQA-Assembly is a new multimodal procedural QA dataset with 646 pairs on assembly activities, built via LLM-generated candidates verified by humans plus 81 task graphs, and used to benchmark multimodal models.
SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning
cs.CV 2025-06 conditional novelty 7.0

SIV-Bench is a new video benchmark with 2,792 clips and 5,455 QA pairs that evaluates MLLMs on social scene understanding, state reasoning, and dynamics prediction using social relation theory.
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
cs.CV 2025-05 conditional novelty 7.0

Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
cs.CV 2025-04 unverdicted novelty 7.0

SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.
Video-R1: Reinforcing Video Reasoning in MLLMs
cs.CV 2025-03 conditional novelty 7.0

Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
Unified Reward Model for Multimodal Understanding and Generation
cs.CV 2025-03 unverdicted novelty 7.0

UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
cs.CV 2025-02 unverdicted novelty 7.0

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
cs.CV 2025-01 unverdicted novelty 7.0

Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
VidHal: Benchmarking Temporal Hallucinations in Vision LLMs
cs.CV 2024-11 unverdicted novelty 7.0

VidHal is a new benchmark that evaluates VLLM temporal hallucinations through a caption ordering task on videos with varying hallucination levels.
LVBench: An Extreme Long Video Understanding Benchmark
cs.CV 2024-06 accept novelty 7.0

LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.
MLVU: Benchmarking Multi-task Long Video Understanding
cs.CV 2024-06 conditional novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding
cs.AI 2026-05 unverdicted novelty 6.0

ST-GridPool improves video LLM performance via hierarchical temporal gridding and norm-based spatial pooling on visual tokens without training.
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
cs.CV 2026-05 unverdicted novelty 6.0

SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
An Efficient Streaming Video Understanding Framework with Agentic Control
cs.CV 2026-05 unverdicted novelty 6.0

R3-Streaming uses cascaded control, age-aware memory forgetting, and TB-GRPO reinforcement learning to reach SOTA scores on streaming video benchmarks while cutting visual token usage by 95-96%.
TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation
cs.CV 2026-05 unverdicted novelty 6.0

TRACE builds structured text timelines from videos via OCR and detection, then applies text-only LLM evidence localization before LVLM claim generation, raising MiRAGE F1 from 0.705 to 0.811 on MAGMaR.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 6.0

Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
Personal Visual Context Learning in Large Multimodal Models
cs.CV 2026-05 unverdicted novelty 6.0

Introduces Personal VCL formalization and benchmark revealing LMM context gaps, plus an Agentic Context Bank baseline that boosts personalized visual reasoning.
MMTB: Evaluating Terminal Agents on Multimedia-File Tasks
cs.MM 2026-05 unverdicted novelty 6.0

MMTB is a new benchmark with 105 multimedia terminal tasks that shows how audio and video access changes agent performance and evidence use in executable workflows.
Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

VAP is a training-free active-perception method that improves zero-shot long-form video QA performance and frame efficiency up to 5.6x in VLMs by selecting keyframes that differ from priors generated by a text-conditi...
Video-ToC: Video Tree-of-Cue Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents
cs.AI 2026-04 unverdicted novelty 6.0

Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque ...
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
cs.CV 2026-01 unverdicted novelty 6.0

HERMES organizes the KV cache into a hierarchical memory to enable real-time streaming video understanding in MLLMs, achieving 10x faster TTFT and up to 11.4% accuracy gains on streaming benchmarks with 68% fewer tokens.
Streaming Video Instruction Tuning
cs.CV 2025-12 unverdicted novelty 6.0

Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.
StreamingVLM: Real-Time Understanding for Infinite Video Streams
cs.CV 2025-10 unverdicted novelty 6.0

StreamingVLM enables stable real-time understanding of infinite video streams at up to 8 FPS using a streaming KV cache and aligned SFT on overlapped chunks, with a 66.18% win rate over GPT-4O mini on a new two-hour v...
Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents
cs.CV 2025-09 unverdicted novelty 6.0

CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.
Qwen3-Omni Technical Report
cs.CL 2025-09 unverdicted novelty 6.0

Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
Training-Free Multimodal Large Language Model Orchestration
cs.CL 2025-08 unverdicted novelty 6.0

LLM Orchestration integrates modality experts via an LLM controller, cross-modal memory, and interaction layer to enable multimodal input-output without gradient-based training.
ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs
cs.CV 2025-07 unverdicted novelty 6.0

ReGATE introduces a teacher-student adaptive token elision method that reduces training tokens to 38% while matching or exceeding baseline accuracy on multimodal benchmarks.
ChipSeek: Optimizing Verilog Generation via EDA-Integrated Reinforcement Learning
cs.AI 2025-07 unverdicted novelty 6.0

ChipSeek is a hierarchical-reward reinforcement learning framework with Curriculum-Guided Dynamic Policy Optimization that integrates EDA simulator feedback to improve LLM-generated RTL code on both functional correct...
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
cs.CV 2025-07 unverdicted novelty 6.0

GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory
cs.CV 2025-05 unverdicted novelty 6.0

TrajViT tokenizes videos via panoptic sub-object trajectories, achieving 10x token reduction and outperforming ViT3D by 6% on retrieval and 5.2% on VideoQA tasks with faster training and inference.
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
cs.CV 2025-05 unverdicted novelty 6.0

VLM-3R augments VLMs with implicit 3D tokens from monocular video via geometry encoding and 200K+ 3D reconstructive QA pairs, plus a new 138K-pair temporal benchmark, to support spatial and embodied reasoning.
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
cs.LG 2025-05 conditional novelty 6.0

LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval
cs.CV 2025-05 unverdicted novelty 6.0

LiveVLM introduces VSB and PaR to compress and retrieve KV cache in streaming video LLMs, enabling LLaVA-OneVision to reach SOTA accuracy among training-free query-agnostic and training-based online models.
Perception Encoder: The best visual embeddings are not at the output of the network
cs.CV 2025-04 unverdicted novelty 6.0

Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, a...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
cs.CV 2025-01 conditional novelty 6.0

Sa2VA unifies SAM-2 segmentation with MLLM reasoning into a single model for referring segmentation and conversation on images and videos, supported by a new 72k-expression Ref-SAV dataset.
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
cs.CV 2025-01 unverdicted novelty 6.0

MotionBench is a new benchmark showing poor fine-grained motion understanding in VLMs and proposes TE Fusion to improve performance with higher frame rates.
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
cs.CV 2024-12 unverdicted novelty 6.0

VideoChat-Flash applies hierarchical video token compression to achieve ~50x reduction in context length for long videos while maintaining near-original performance on long-context benchmarks.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · cited by 88 Pith papers

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS,

work page
[2]

Training-free long- context scaling of large language models

Chen An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong. Training-free long- context scaling of large language models. ArXiv preprint,

work page
[3]

In- finibench: A comprehensive benchmark for large multi- modal models in very long video understanding

Kirolos Ataallah, Chenhui Gou, Eslam Abdelrahman, Khushbu Pahwa, Jian Ding, and Mohamed Elhoseiny. In- finibench: A comprehensive benchmark for large multi- modal models in very long video understanding. ArXiv preprint, 2024. 3

work page 2024
[4]

Openflamingo: An open-source frame- work for training large autoregressive vision-language mod- els

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo: An open-source frame- work for training large autoregressive vision-language mod- els. ArXiv preprint...

work page 2023
[5]

Qwen-vl: A frontier large vision-language model with versatile abilities

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv preprint, 2023. 2, 6

work page 2023
[6]

Fuyu-8b: A multimodal architecture for ai agents

Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sagnak Tasırlar. Fuyu-8b: A multimodal architecture for ai agents. URL: https://www.adept.ai/blog/fuyu-8b, 2023. 3

work page 2023
[7]

Vlp: A survey on vision-language pre-training

Fei-Long Chen, Du-Zhen Zhang, Ming-Lun Han, Xiu-Yi Chen, Jing Shi, Shuang Xu, and Bo Xu. Vlp: A survey on vision-language pre-training. Machine Intelligence Re- search, 2023. 3

work page 2023
[8]

Sharegpt4video: Improving video understanding and generation with better captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions. ArXiv preprint, 2024. 6

work page 2024
[9]

Towards multimodal video paragraph captioning models robust to missing modal- ity

Sishuo Chen, Lei Li, Shuhuai Ren, Rundong Gao, Yuanxin Liu, Xiaohan Bi, Xu Sun, and Lu Hou. Towards multimodal video paragraph captioning models robust to missing modal- ity. ArXiv preprint, 2024. 8

work page 2024
[10]

Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering

Xiuyuan Chen, Yuan Lin, Yuchen Zhang, and Weiran Huang. Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering. ArXiv preprint, 2023. 3

work page 2023
[11]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. ArXiv preprint, 2024. 2, 6

work page 2024
[12]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 3

work page 2023
[13]

Instructblip: Towards general- purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning. ArXiv preprint, 2023. 3, 8

work page 2023
[14]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021. 3

work page 2021
[15]

Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing. ArXiv preprint, 2024. 3

work page 2024
[16]

Mme: A comprehensive evaluation benchmark for multimodal large language models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. ArXiv preprint, 2023. 1, 3

work page 2023
[17]

Vita: Towards open-source interactive omni multimodal llm

Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Shaoqi Dong, Xiong Wang, Di Yin, Long Ma, et al. Vita: Towards open-source interactive omni multimodal llm. ArXiv preprint, 2024. 6

work page 2024
[18]

Vita-1.5: Towards gpt-4o level real-time vision and speech interaction

Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yun- hang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, et al. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction. ArXiv preprint, 2025. 1, 6

work page 2025
[19]

Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% perfor- mance

Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, et al. Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% perfor- mance. Visual Intelligence, 2024. 3

work page 2024
[20]

Lita: Language instructed temporal-localization assistant

De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, and Jan Kautz. Lita: Language instructed temporal-localization assistant. ArXiv preprint, 2024. 3

work page 2024
[21]

TGIF-QA: toward spatio-temporal reasoning in visual question answering

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. TGIF-QA: toward spatio-temporal reasoning in visual question answering. In CVPR, 2017. 3

work page 2017
[22]

Ef- fectiveness assessment of recent large vision-language mod- els

Yao Jiang, Xinyu Yan, Ge-Peng Ji, Keren Fu, Meijun Sun, Huan Xiong, Deng-Ping Fan, and Fahad Shahbaz Khan. Ef- fectiveness assessment of recent large vision-language mod- els. Visual Intelligence, 2024. 3

work page 2024
[23]

Chat-univi: Unified visual representation em- powers large language models with image and video under- standing

Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video under- standing. ArXiv preprint, 2023. 6

work page 2023
[24]

TVQA: Localized, compositional video question answering

Jie Lei, Licheng Yu, Mohit Bansal, and Tamara Berg. TVQA: Localized, compositional video question answering. In EMNLP, 2018. 3

work page 2018
[25]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023. 3

work page 2023
[26]

Videochat: Chat-centric video understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wen- hai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. ArXiv preprint, 2023. 3, 8

work page 2023
[27]

Mvbench: A comprehensive multi-modal video understand- ing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. ArXiv preprint, 2023. 1, 3, 6

work page 2023
[28]

Hero: Hierarchical encoder for video+ lan- guage omni-representation pre-training

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. Hero: Hierarchical encoder for video+ lan- guage omni-representation pre-training. In EMNLP, 2020. 3

work page 2020
[29]

Value: A multi-task benchmark for video-and-language understanding evaluation

Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai, Yu Cheng, Luowei Zhou, Xin Eric Wang, William Yang Wang, et al. Value: A multi-task benchmark for video-and-language understanding evaluation. ArXiv preprint, 2021. 3

work page 2021
[30]

M 3IT: A large-scale dataset towards multi-modal multilingual instruction tuning

Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, Lingpeng Kong, and Qi Liu. M 3IT: A large-scale dataset towards multi-modal multilingual instruction tuning. ArXiv preprint, 2023. 8

work page 2023
[31]

Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models

Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. ArXiv preprint, 2024. 3

work page 2024
[32]

Vitatecs: A diag- nostic dataset for temporal concept understanding of video- language models

Shicheng Li, Lei Li, Shuhuai Ren, Yuanxin Liu, Yi Liu, Rundong Gao, Xu Sun, and Lu Hou. Vitatecs: A diag- nostic dataset for temporal concept understanding of video- language models. ArXiv preprint, 2023. 1, 3, 8

work page 2023
[33]

Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. ArXiv preprint, 2023. 1, 6

work page 2023
[34]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. ArXiv preprint, 2023. 2, 6

work page 2023
[35]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. ArXiv preprint, 2023. 3

work page 2023
[36]

World model on million-length video and language with ringattention

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. ArXiv preprint, 2024. 8

work page 2024
[37]

Ring atten- tion with blockwise transformers for near-infinite context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring atten- tion with blockwise transformers for near-infinite context. In ICLR, 2024. 8

work page 2024
[38]

St-llm: Large language models are effective tem- poral learners

Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective tem- poral learners. ArXiv preprint, 2024. 6

work page 2024
[39]

Best practices and lessons learned on synthetic data for language models

Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, et al. Best practices and lessons learned on synthetic data for language models. ArXiv preprint, 2024. 8

work page 2024
[40]

Mmbench: Is your multi-modal model an all-around player? ArXiv preprint, 2023

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? ArXiv preprint, 2023. 3

work page 2023
[41]

Temp- compass: Do video llms really understand videos? ArXiv preprint, 2024

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Temp- compass: Do video llms really understand videos? ArXiv preprint, 2024. 1, 3

work page 2024
[42]

The flan collection: Designing data and methods for effective instruction tuning

Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. ArXiv preprint,

work page
[43]

Mathvista: Evaluating math rea- soning in visual contexts with gpt-4v, bard, and other large multimodal models

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun yue Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating math rea- soning in visual contexts with gpt-4v, bard, and other large multimodal models. ArXiv preprint, 2023. 1, 3

work page 2023
[44]

Val- ley: Video assistant with large language model enhanced ability

Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Ming- Hui Qiu, Pengcheng Lu, Tao Wang, and Zhongyu Wei. Val- ley: Video assistant with large language model enhanced ability. ArXiv preprint, 2023. 3

work page 2023
[45]

Egoschema: A diagnostic benchmark for very long- form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. In NeurIPS, 2024. 1, 3, 5, 8

work page 2024
[46]

Cross-task generalization via natural lan- guage crowdsourcing instructions

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Han- naneh Hajishirzi. Cross-task generalization via natural lan- guage crowdsourcing instructions. In ACL, 2022. 8

work page 2022
[47]

Video-bench: A com- prehensive benchmark and toolkit for evaluating video-based large language models

Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A com- prehensive benchmark and toolkit for evaluating video-based large language models. ArXiv preprint, 2023. 3

work page 2023
[48]

GPT-4V(ision) system card, 2023

OpenAI. GPT-4V(ision) system card, 2023. 2, 6

work page 2023
[49]

GPT-4o system card, 2024

OpenAI. GPT-4o system card, 2024. 2, 6

work page 2024
[50]

Momen- tor: Advancing video large language model with fine-grained temporal reasoning

Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat- Seng Chua, Yueting Zhuang, and Siliang Tang. Momen- tor: Advancing video large language model with fine-grained temporal reasoning. ArXiv preprint, 2024. 3

work page 2024
[51]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021. 3

work page 2021
[52]

TESTA: Temporal-spatial token aggregation for long- form video-language understanding

Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu Sun, and Lu Hou. TESTA: Temporal-spatial token aggregation for long- form video-language understanding. In EMNLP, 2023. 8

work page 2023
[53]

Timechat: A time-sensitive multimodal large language model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. ArXiv preprint, 2023. 3, 8

work page 2023
[54]

Moviechat: From dense token to sparse memory for long video under- standing

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tianbo Ye, Yang Lu, Jenq-Neng Hwang, and Gaoang Wang. Moviechat: From dense token to sparse memory for long video under- standing. ArXiv preprint, 2023. 3

work page 2023
[55]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text. ArXiv preprint, 2024. 1, 2, 6

work page 2024
[56]

Llama: Open and efficient foundation language models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ArXiv preprint, 2023. 3

work page 2023
[57]

Lvbench: An extreme long video understanding benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiao- han Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Lvbench: An extreme long video understanding benchmark. ArXiv preprint, 2024. 3

work page 2024
[58]

Vatex: A large-scale, high- quality multilingual dataset for video-and-language research

Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high- quality multilingual dataset for video-and-language research. In ICCV, 2019. 3

work page 2019
[59]

Large-scale multi-modal pre-trained models: A comprehensive survey

Xiao Wang, Guangyao Chen, Guangwu Qian, Pengcheng Gao, Xiao-Yong Wei, Yaowei Wang, Yonghong Tian, and Wen Gao. Large-scale multi-modal pre-trained models: A comprehensive survey. Machine Intelligence Research ,

work page
[60]

Hawkeye: Training video-text llms for grounding text in videos

Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, and Dongyan Zhao. Hawkeye: Training video-text llms for grounding text in videos. ArXiv preprint, 2024. 3

work page 2024
[61]

Star: A benchmark for situated reasoning in real-world videos

Bo Wu and Shoubin Yu. Star: A benchmark for situated reasoning in real-world videos. In NeurIPS, 2024. 3

work page 2024
[62]

Longvideobench: A benchmark for long-context interleaved video-language understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. ArXiv preprint, 2024. 3

work page 2024
[63]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In CVPR, 2021. 3

work page 2021
[64]

Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xi- angnan He, and Yueting Zhuang

D. Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xi- angnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017. 3

work page 2017
[65]

Vision- flan:scaling visual instruction tuning

Zhiyang Xu, Trevor Ashby, Chao Feng, Rulin Shao, Ying Shen, Di Jin, Qifan Wang, and Lifu Huang. Vision- flan:scaling visual instruction tuning. ArXiv preprint, 2023. 8

work page 2023
[66]

A survey on multimodal large language models

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models. National Science Review, 2024. 1, 3

work page 2024
[67]

Mm-vet: Evaluating large multimodal models for integrated capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. ArXiv preprint, 2023. 1, 3

work page 2023
[68]

Activitynet-qa: A dataset for understanding complex web videos via question answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, 2019. 3

work page 2019
[69]

Mmmu: A massive multi-discipline multimodal understand- ing and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Ren- liang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understand- ing and reasoning benchmark for...

work page
[70]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, 2023. 3

work page 2023
[71]

Video-llama: An instruction-tuned audio-visual language model for video un- derstanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. ArXiv preprint, 2023. 3

work page 2023
[72]

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? ArXiv preprint, 2024

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? ArXiv preprint, 2024. 3, 4

work page 2024
[73]

Llava- next: A strong zero-shot video understanding model, 2024

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 2, 6 Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Supplementary Material

work page 2024
[74]

entire video frames + complete subtitles/audios (optional) + ques- tion with prompt

Detailed Experimental Settings Models. We conduct a comprehensive evaluation on four commercial models and nine representative open-source video-based multimodal large language models. To further demonstrate the adaptability of our benchmark to multi- image scenarios, we also include three widely utilized image-based MLLMs as part of the evaluation. The c...

work page
[75]

Additional Analysis How do MLLMs perform on the two highlighted cases in Figure 1? We conduct qualitative evaluation (using frames and subtitles) on the two cases in Figure 1. As analyzed in Section 3.2, these two cases comprehensively examine the model’s capabilities in OCR, attribute percep- tion, object recognition, and long-range temporal reason- ing,...

work page

[1] [1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS,

work page

[2] [2]

Training-free long- context scaling of large language models

Chen An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong. Training-free long- context scaling of large language models. ArXiv preprint,

work page

[3] [3]

In- finibench: A comprehensive benchmark for large multi- modal models in very long video understanding

Kirolos Ataallah, Chenhui Gou, Eslam Abdelrahman, Khushbu Pahwa, Jian Ding, and Mohamed Elhoseiny. In- finibench: A comprehensive benchmark for large multi- modal models in very long video understanding. ArXiv preprint, 2024. 3

work page 2024

[4] [4]

Openflamingo: An open-source frame- work for training large autoregressive vision-language mod- els

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo: An open-source frame- work for training large autoregressive vision-language mod- els. ArXiv preprint...

work page 2023

[5] [5]

Qwen-vl: A frontier large vision-language model with versatile abilities

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv preprint, 2023. 2, 6

work page 2023

[6] [6]

Fuyu-8b: A multimodal architecture for ai agents

Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sagnak Tasırlar. Fuyu-8b: A multimodal architecture for ai agents. URL: https://www.adept.ai/blog/fuyu-8b, 2023. 3

work page 2023

[7] [7]

Vlp: A survey on vision-language pre-training

Fei-Long Chen, Du-Zhen Zhang, Ming-Lun Han, Xiu-Yi Chen, Jing Shi, Shuang Xu, and Bo Xu. Vlp: A survey on vision-language pre-training. Machine Intelligence Re- search, 2023. 3

work page 2023

[8] [8]

Sharegpt4video: Improving video understanding and generation with better captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions. ArXiv preprint, 2024. 6

work page 2024

[9] [9]

Towards multimodal video paragraph captioning models robust to missing modal- ity

Sishuo Chen, Lei Li, Shuhuai Ren, Rundong Gao, Yuanxin Liu, Xiaohan Bi, Xu Sun, and Lu Hou. Towards multimodal video paragraph captioning models robust to missing modal- ity. ArXiv preprint, 2024. 8

work page 2024

[10] [10]

Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering

Xiuyuan Chen, Yuan Lin, Yuchen Zhang, and Weiran Huang. Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering. ArXiv preprint, 2023. 3

work page 2023

[11] [11]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. ArXiv preprint, 2024. 2, 6

work page 2024

[12] [12]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 3

work page 2023

[13] [13]

Instructblip: Towards general- purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning. ArXiv preprint, 2023. 3, 8

work page 2023

[14] [14]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021. 3

work page 2021

[15] [15]

Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing. ArXiv preprint, 2024. 3

work page 2024

[16] [16]

Mme: A comprehensive evaluation benchmark for multimodal large language models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. ArXiv preprint, 2023. 1, 3

work page 2023

[17] [17]

Vita: Towards open-source interactive omni multimodal llm

Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Shaoqi Dong, Xiong Wang, Di Yin, Long Ma, et al. Vita: Towards open-source interactive omni multimodal llm. ArXiv preprint, 2024. 6

work page 2024

[18] [18]

Vita-1.5: Towards gpt-4o level real-time vision and speech interaction

Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yun- hang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, et al. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction. ArXiv preprint, 2025. 1, 6

work page 2025

[19] [19]

Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% perfor- mance

Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, et al. Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% perfor- mance. Visual Intelligence, 2024. 3

work page 2024

[20] [20]

Lita: Language instructed temporal-localization assistant

De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, and Jan Kautz. Lita: Language instructed temporal-localization assistant. ArXiv preprint, 2024. 3

work page 2024

[21] [21]

TGIF-QA: toward spatio-temporal reasoning in visual question answering

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. TGIF-QA: toward spatio-temporal reasoning in visual question answering. In CVPR, 2017. 3

work page 2017

[22] [22]

Ef- fectiveness assessment of recent large vision-language mod- els

Yao Jiang, Xinyu Yan, Ge-Peng Ji, Keren Fu, Meijun Sun, Huan Xiong, Deng-Ping Fan, and Fahad Shahbaz Khan. Ef- fectiveness assessment of recent large vision-language mod- els. Visual Intelligence, 2024. 3

work page 2024

[23] [23]

Chat-univi: Unified visual representation em- powers large language models with image and video under- standing

Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video under- standing. ArXiv preprint, 2023. 6

work page 2023

[24] [24]

TVQA: Localized, compositional video question answering

Jie Lei, Licheng Yu, Mohit Bansal, and Tamara Berg. TVQA: Localized, compositional video question answering. In EMNLP, 2018. 3

work page 2018

[25] [25]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023. 3

work page 2023

[26] [26]

Videochat: Chat-centric video understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wen- hai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. ArXiv preprint, 2023. 3, 8

work page 2023

[27] [27]

Mvbench: A comprehensive multi-modal video understand- ing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. ArXiv preprint, 2023. 1, 3, 6

work page 2023

[28] [28]

Hero: Hierarchical encoder for video+ lan- guage omni-representation pre-training

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. Hero: Hierarchical encoder for video+ lan- guage omni-representation pre-training. In EMNLP, 2020. 3

work page 2020

[29] [29]

Value: A multi-task benchmark for video-and-language understanding evaluation

Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai, Yu Cheng, Luowei Zhou, Xin Eric Wang, William Yang Wang, et al. Value: A multi-task benchmark for video-and-language understanding evaluation. ArXiv preprint, 2021. 3

work page 2021

[30] [30]

M 3IT: A large-scale dataset towards multi-modal multilingual instruction tuning

Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, Lingpeng Kong, and Qi Liu. M 3IT: A large-scale dataset towards multi-modal multilingual instruction tuning. ArXiv preprint, 2023. 8

work page 2023

[31] [31]

Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models

Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. ArXiv preprint, 2024. 3

work page 2024

[32] [32]

Vitatecs: A diag- nostic dataset for temporal concept understanding of video- language models

Shicheng Li, Lei Li, Shuhuai Ren, Yuanxin Liu, Yi Liu, Rundong Gao, Xu Sun, and Lu Hou. Vitatecs: A diag- nostic dataset for temporal concept understanding of video- language models. ArXiv preprint, 2023. 1, 3, 8

work page 2023

[33] [33]

Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. ArXiv preprint, 2023. 1, 6

work page 2023

[34] [34]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. ArXiv preprint, 2023. 2, 6

work page 2023

[35] [35]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. ArXiv preprint, 2023. 3

work page 2023

[36] [36]

World model on million-length video and language with ringattention

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. ArXiv preprint, 2024. 8

work page 2024

[37] [37]

Ring atten- tion with blockwise transformers for near-infinite context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring atten- tion with blockwise transformers for near-infinite context. In ICLR, 2024. 8

work page 2024

[38] [38]

St-llm: Large language models are effective tem- poral learners

Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective tem- poral learners. ArXiv preprint, 2024. 6

work page 2024

[39] [39]

Best practices and lessons learned on synthetic data for language models

Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, et al. Best practices and lessons learned on synthetic data for language models. ArXiv preprint, 2024. 8

work page 2024

[40] [40]

Mmbench: Is your multi-modal model an all-around player? ArXiv preprint, 2023

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? ArXiv preprint, 2023. 3

work page 2023

[41] [41]

Temp- compass: Do video llms really understand videos? ArXiv preprint, 2024

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Temp- compass: Do video llms really understand videos? ArXiv preprint, 2024. 1, 3

work page 2024

[42] [42]

The flan collection: Designing data and methods for effective instruction tuning

Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. ArXiv preprint,

work page

[43] [43]

Mathvista: Evaluating math rea- soning in visual contexts with gpt-4v, bard, and other large multimodal models

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun yue Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating math rea- soning in visual contexts with gpt-4v, bard, and other large multimodal models. ArXiv preprint, 2023. 1, 3

work page 2023

[44] [44]

Val- ley: Video assistant with large language model enhanced ability

Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Ming- Hui Qiu, Pengcheng Lu, Tao Wang, and Zhongyu Wei. Val- ley: Video assistant with large language model enhanced ability. ArXiv preprint, 2023. 3

work page 2023

[45] [45]

Egoschema: A diagnostic benchmark for very long- form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. In NeurIPS, 2024. 1, 3, 5, 8

work page 2024

[46] [46]

Cross-task generalization via natural lan- guage crowdsourcing instructions

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Han- naneh Hajishirzi. Cross-task generalization via natural lan- guage crowdsourcing instructions. In ACL, 2022. 8

work page 2022

[47] [47]

Video-bench: A com- prehensive benchmark and toolkit for evaluating video-based large language models

Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A com- prehensive benchmark and toolkit for evaluating video-based large language models. ArXiv preprint, 2023. 3

work page 2023

[48] [48]

GPT-4V(ision) system card, 2023

OpenAI. GPT-4V(ision) system card, 2023. 2, 6

work page 2023

[49] [49]

GPT-4o system card, 2024

OpenAI. GPT-4o system card, 2024. 2, 6

work page 2024

[50] [50]

Momen- tor: Advancing video large language model with fine-grained temporal reasoning

Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat- Seng Chua, Yueting Zhuang, and Siliang Tang. Momen- tor: Advancing video large language model with fine-grained temporal reasoning. ArXiv preprint, 2024. 3

work page 2024

[51] [51]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021. 3

work page 2021

[52] [52]

TESTA: Temporal-spatial token aggregation for long- form video-language understanding

Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu Sun, and Lu Hou. TESTA: Temporal-spatial token aggregation for long- form video-language understanding. In EMNLP, 2023. 8

work page 2023

[53] [53]

Timechat: A time-sensitive multimodal large language model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. ArXiv preprint, 2023. 3, 8

work page 2023

[54] [54]

Moviechat: From dense token to sparse memory for long video under- standing

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tianbo Ye, Yang Lu, Jenq-Neng Hwang, and Gaoang Wang. Moviechat: From dense token to sparse memory for long video under- standing. ArXiv preprint, 2023. 3

work page 2023

[55] [55]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text. ArXiv preprint, 2024. 1, 2, 6

work page 2024

[56] [56]

Llama: Open and efficient foundation language models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ArXiv preprint, 2023. 3

work page 2023

[57] [57]

Lvbench: An extreme long video understanding benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiao- han Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Lvbench: An extreme long video understanding benchmark. ArXiv preprint, 2024. 3

work page 2024

[58] [58]

Vatex: A large-scale, high- quality multilingual dataset for video-and-language research

Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high- quality multilingual dataset for video-and-language research. In ICCV, 2019. 3

work page 2019

[59] [59]

Large-scale multi-modal pre-trained models: A comprehensive survey

Xiao Wang, Guangyao Chen, Guangwu Qian, Pengcheng Gao, Xiao-Yong Wei, Yaowei Wang, Yonghong Tian, and Wen Gao. Large-scale multi-modal pre-trained models: A comprehensive survey. Machine Intelligence Research ,

work page

[60] [60]

Hawkeye: Training video-text llms for grounding text in videos

Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, and Dongyan Zhao. Hawkeye: Training video-text llms for grounding text in videos. ArXiv preprint, 2024. 3

work page 2024

[61] [61]

Star: A benchmark for situated reasoning in real-world videos

Bo Wu and Shoubin Yu. Star: A benchmark for situated reasoning in real-world videos. In NeurIPS, 2024. 3

work page 2024

[62] [62]

Longvideobench: A benchmark for long-context interleaved video-language understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. ArXiv preprint, 2024. 3

work page 2024

[63] [63]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In CVPR, 2021. 3

work page 2021

[64] [64]

Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xi- angnan He, and Yueting Zhuang

D. Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xi- angnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017. 3

work page 2017

[65] [65]

Vision- flan:scaling visual instruction tuning

Zhiyang Xu, Trevor Ashby, Chao Feng, Rulin Shao, Ying Shen, Di Jin, Qifan Wang, and Lifu Huang. Vision- flan:scaling visual instruction tuning. ArXiv preprint, 2023. 8

work page 2023

[66] [66]

A survey on multimodal large language models

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models. National Science Review, 2024. 1, 3

work page 2024

[67] [67]

Mm-vet: Evaluating large multimodal models for integrated capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. ArXiv preprint, 2023. 1, 3

work page 2023

[68] [68]

Activitynet-qa: A dataset for understanding complex web videos via question answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, 2019. 3

work page 2019

[69] [69]

Mmmu: A massive multi-discipline multimodal understand- ing and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Ren- liang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understand- ing and reasoning benchmark for...

work page

[70] [70]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, 2023. 3

work page 2023

[71] [71]

Video-llama: An instruction-tuned audio-visual language model for video un- derstanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. ArXiv preprint, 2023. 3

work page 2023

[72] [72]

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? ArXiv preprint, 2024

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? ArXiv preprint, 2024. 3, 4

work page 2024

[73] [73]

Llava- next: A strong zero-shot video understanding model, 2024

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 2, 6 Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Supplementary Material

work page 2024

[74] [74]

entire video frames + complete subtitles/audios (optional) + ques- tion with prompt

Detailed Experimental Settings Models. We conduct a comprehensive evaluation on four commercial models and nine representative open-source video-based multimodal large language models. To further demonstrate the adaptability of our benchmark to multi- image scenarios, we also include three widely utilized image-based MLLMs as part of the evaluation. The c...

work page

[75] [75]

Additional Analysis How do MLLMs perform on the two highlighted cases in Figure 1? We conduct qualitative evaluation (using frames and subtitles) on the two cases in Figure 1. As analyzed in Section 3.2, these two cases comprehensively examine the model’s capabilities in OCR, attribute percep- tion, object recognition, and long-range temporal reason- ing,...

work page