pith. machine review for the scientific record. sign in

arxiv: 2505.21374 · v1 · submitted 2025-05-27 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

Authors on Pith no claims yet

Pith reviewed 2026-05-17 05:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal large language modelsvideo reasoning benchmarkclue integrationcomplex reasoningsuspense filmsMLLM evaluationSherlock Holmesinformation integration
0
0 comments X

The pith

Multimodal models perceive video details but fail to integrate scattered clues, scoring at most 45 percent on a new Holmes-inspired benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Video-Holmes as a benchmark to evaluate if multimodal large language models can reason about videos in the complex, clue-integrating way that human experts do. It draws from 270 suspense short films to create 1,837 questions across seven tasks, each requiring models to actively find and connect multiple visual clues from different parts of the video rather than relying on single obvious cues. Existing benchmarks focus on basic perception, which does not test the full process of searching, integrating, and analyzing information to reach conclusions. Testing shows that even the best model, Gemini-2.5-Pro, only reaches 45 percent accuracy, and most others score below 40 percent, highlighting difficulties in information integration despite good visual perception. This evaluation matters because it identifies specific limitations that must be overcome for models to handle real-world video reasoning tasks effectively.

Core claim

Video-Holmes consists of questions derived from manually annotated suspense short films where key events and causal relationships are identified first, followed by the design of questions that necessitate models to locate and link multiple relevant visual clues scattered across different video segments. Comprehensive evaluations of state-of-the-art MLLMs indicate that while they generally perform well on visual perception, they struggle substantially with integrating information and frequently overlook critical clues, as evidenced by the top accuracy of 45 percent for Gemini-2.5-Pro and most models scoring below 40 percent. The benchmark aims to act as a diagnostic test to encourage the use,

What carries the argument

The Video-Holmes benchmark, built by annotating key events and causal relationships in suspense films and then crafting questions that demand active location and connection of multiple visual clues from various segments.

Load-bearing premise

The seven manually designed tasks from suspense films require and measure active search, integration, and analysis of multiple clues in a way that matches human expert reasoning processes.

What would settle it

A model achieving high accuracy on the benchmark through single-cue answers or random guessing instead of multi-clue integration, or an analysis showing that errors are predominantly perceptual rather than integrative.

read the original abstract

Recent advances in CoT reasoning and RL post-training have been reported to enhance video reasoning capabilities of MLLMs. This progress naturally raises a question: can these models perform complex video reasoning in a manner comparable to human experts? However, existing video benchmarks primarily evaluate visual perception and grounding abilities, with questions that can be answered based on explicit prompts or isolated visual cues. Such benchmarks do not fully capture the intricacies of real-world reasoning, where humans must actively search for, integrate, and analyze multiple clues before reaching a conclusion. To address this issue, we present Video-Holmes, a benchmark inspired by the reasoning process of Sherlock Holmes, designed to evaluate the complex video reasoning capabilities of MLLMs. Video-Holmes consists of 1,837 questions derived from 270 manually annotated suspense short films, which spans seven carefully designed tasks. Each task is constructed by first identifying key events and causal relationships within films, and then designing questions that require models to actively locate and connect multiple relevant visual clues scattered across different video segments. Our comprehensive evaluation of state-of-the-art MLLMs reveals that, while these models generally excel at visual perception, they encounter substantial difficulties with integrating information and often miss critical clues. For example, the best-performing model, Gemini-2.5-Pro, achieves an accuracy of only 45%, with most models scoring below 40%. We aim that Video-Holmes can serve as a "Holmes-test" for multimodal reasoning, motivating models to reason more like humans and emphasizing the ongoing challenges in this field. The benchmark is released in https://github.com/TencentARC/Video-Holmes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Video-Holmes, a benchmark for complex video reasoning in MLLMs consisting of 1,837 questions derived from 270 manually annotated suspense short films across seven tasks. Each task is constructed by identifying key events and causal relationships, then designing questions that require active location and integration of multiple visual clues scattered across video segments. Evaluations of state-of-the-art MLLMs show strong visual perception but substantial difficulties with information integration, with Gemini-2.5-Pro achieving 45% accuracy and most models below 40%. The benchmark is released publicly to serve as a 'Holmes-test' for multimodal reasoning.

Significance. If validated, Video-Holmes would address a genuine gap in existing video benchmarks that focus primarily on perception and isolated cues rather than multi-clue integration. The open release of the benchmark on GitHub is a clear strength supporting reproducibility and future research. The headline result (top model at 45%) could usefully motivate work on better long-range temporal reasoning and clue aggregation if the tasks are shown to be solvable at substantially higher rates by humans.

major comments (3)
  1. [§5] §5 (Experiments): The claim that models 'encounter substantial difficulties with integrating information and often miss critical clues' rests on reported accuracies such as 45% for Gemini-2.5-Pro. However, no human performance baseline is provided on the same 1,837 questions and videos. Without this anchor, the numerical gap could reflect overall task hardness rather than a specific integration deficit in MLLMs.
  2. [§3] §3 (Benchmark Construction): The process of manually identifying key events, causal relationships, and writing multi-clue questions is described, but no inter-annotator agreement statistics, question validation procedures, or controls for annotation bias are reported. This information is load-bearing for the central assertion that the tasks accurately measure active search and integration comparable to human expert reasoning.
  3. [§4] §4 (Task Design): The seven tasks are presented as requiring models to connect clues across segments, yet there is limited evidence or analysis showing that the questions cannot be solved from isolated visual cues or simpler perception alone. This distinction is essential to differentiate Video-Holmes from prior benchmarks and to support the integration-difficulty interpretation.
minor comments (2)
  1. [Abstract] Abstract: The statement 'most models scoring below 40%' would benefit from a specific list or reference to the corresponding table/figure for immediate clarity.
  2. [Tables/Figures] Figure or Table captions (e.g., performance tables): Adding error bars or per-task breakdowns would improve interpretability of the integration-difficulty claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below, agreeing where revisions are warranted to improve the manuscript's rigor and clarity.

read point-by-point responses
  1. Referee: [§5] The claim that models 'encounter substantial difficulties with integrating information and often miss critical clues' rests on reported accuracies such as 45% for Gemini-2.5-Pro. However, no human performance baseline is provided on the same 1,837 questions and videos. Without this anchor, the numerical gap could reflect overall task hardness rather than a specific integration deficit in MLLMs.

    Authors: We agree that a human performance baseline would strengthen the interpretation of model results by providing a direct comparison. Although the questions were constructed from human-identified key events and causal relationships in suspense films (which attentive viewers can solve through integration), we did not report empirical human accuracy on the full set. In the revised manuscript, we will add human evaluation results on a representative subset of questions to demonstrate substantially higher human performance and support the claim of specific integration challenges for MLLMs. revision: yes

  2. Referee: [§3] The process of manually identifying key events, causal relationships, and writing multi-clue questions is described, but no inter-annotator agreement statistics, question validation procedures, or controls for annotation bias are reported. This information is load-bearing for the central assertion that the tasks accurately measure active search and integration comparable to human expert reasoning.

    Authors: We appreciate the emphasis on annotation quality and transparency. The benchmark was built through manual identification of events and relationships across 270 films, followed by question design requiring multi-segment clue integration. To address this, we will expand the revised manuscript with details on the annotation protocol, number of annotators, inter-annotator agreement statistics, validation procedures, and steps taken to mitigate bias. revision: yes

  3. Referee: [§4] The seven tasks are presented as requiring models to connect clues across segments, yet there is limited evidence or analysis showing that the questions cannot be solved from isolated visual cues or simpler perception alone. This distinction is essential to differentiate Video-Holmes from prior benchmarks and to support the integration-difficulty interpretation.

    Authors: We agree that explicit evidence for the multi-clue integration requirement is necessary to differentiate from perception-focused benchmarks. Tasks were designed around causal relationships spanning segments, with questions requiring active location and connection of scattered clues. We will incorporate additional analysis in the revision, including qualitative examples where isolated cues are insufficient and supporting evidence that simpler perception alone does not suffice for correct answers. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and model evaluation are independent of internal fits or self-referential definitions

full rationale

The paper presents Video-Holmes as an externally constructed benchmark of 1,837 manually annotated questions from 270 suspense films across seven tasks. Model accuracies (e.g., Gemini-2.5-Pro at 45%) are direct empirical measurements on this fixed test set; they are not obtained by fitting parameters to a subset and then relabeling the output as a prediction, nor by any self-definitional loop in which the claimed integration deficit is presupposed in the task design. No equations, uniqueness theorems, or ansatzes are invoked. The central claim rests on the external comparison of MLLM outputs to the benchmark questions rather than on any reduction to quantities defined inside the paper itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that the constructed questions measure human-like complex reasoning through clue integration; this is a domain assumption rather than a derived quantity.

axioms (1)
  • domain assumption Complex real-world video reasoning requires actively searching for, integrating, and analyzing multiple clues scattered across segments rather than relying on explicit prompts or isolated cues.
    Invoked to justify why existing benchmarks are insufficient and to motivate the seven task designs.

pith-pipeline@v0.9.0 · 5607 in / 1162 out tokens · 62261 ms · 2026-05-17T05:36:20.043648+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

    cs.CV 2026-05 unverdicted novelty 8.0

    EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

  2. TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

    cs.CV 2026-05 unverdicted novelty 8.0

    TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

  3. VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

    cs.CV 2026-05 unverdicted novelty 8.0

    VEBENCH is the first benchmark evaluating LMMs on video editing technique recognition and operation simulation using 3.9K videos and 3,080 QA pairs, revealing a large performance gap to humans.

  4. Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    cs.MM 2026-05 unverdicted novelty 7.0

    Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.

  5. VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.

  6. Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging

    cs.CV 2026-04 unverdicted novelty 7.0

    MERIT restores temporal reasoning in VLMs via layer-selective self-attention merging guided by a TR-improving objective that penalizes TP degradation.

  7. Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    cs.MM 2026-05 unverdicted novelty 6.0

    Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.

  8. Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Introduces VURB benchmark and VUP-35K dataset to train discriminative and generative video reward models that achieve SOTA performance on VURB and VideoRewardBench.

  9. MMTB: Evaluating Terminal Agents on Multimedia-File Tasks

    cs.MM 2026-05 unverdicted novelty 6.0

    MMTB is a new benchmark with 105 multimedia terminal tasks that shows how audio and video access changes agent performance and evidence use in executable workflows.

  10. Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    VideoThinker improves lightweight MLLM video reasoning by creating a bias model to capture shortcuts and applying causal debiasing policy optimization to push away from them, achieving SOTA efficiency with minimal data.

  11. MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

    cs.CL 2026-04 unverdicted novelty 6.0

    MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.

  12. Co-Evolving Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...

  13. Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Q-Gate dynamically routes keyframe selection in long videos via query-modulated gating across visual grounding, global matching, and contextual alignment experts to improve MLLM performance.

  14. Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Chain-of-Glimpse is a reinforcement learning framework that builds progressive, spatially grounded reasoning traces around task-relevant objects in videos to enable more accurate and interpretable multi-step decisions.

  15. AdaTooler-V: Adaptive Tool-Use for Images and Videos

    cs.CV 2025-12 conditional novelty 6.0

    AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.

  16. Boosting Reasoning in Large Multimodal Models via Activation Replay

    cs.CV 2025-11 unverdicted novelty 6.0

    Activation Replay boosts multimodal reasoning in post-trained LMMs by replaying low-entropy activations from base models to RLVR counterparts at test time via visual token manipulation.

  17. OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

    cs.CV 2026-04 unverdicted novelty 5.0

    OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoni...

  18. Seed1.8 Model Card: Towards Generalized Real-World Agency

    cs.AI 2026-03 unverdicted novelty 5.0

    Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.

  19. OneThinker: All-in-one Reasoning Model for Image and Video

    cs.CV 2025-12 unverdicted novelty 5.0

    OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.

  20. MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

    cs.CV 2025-11 unverdicted novelty 5.0

    MASS adds spatiotemporal motion signals and 3D grounding to VLMs and releases MASS-Bench, yielding physics-reasoning performance within 2% of Gemini-2.5-Flash after reinforcement fine-tuning.

  21. EasyVideoR1: Easier RL for Video Understanding

    cs.CV 2026-04 unverdicted novelty 4.0

    EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 19 Pith papers · 21 internal anchors

  1. [1]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. 2

  2. [2]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 2, 3

  3. [3]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 2, 3, 4, 7, 9

  4. [4]

    Introducing openai o1

    OpenAI. Introducing openai o1. 2024. 2, 3

  5. [5]

    Openai o3

    OpenAI. Openai o3. 2025. 2, 9

  6. [6]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms. arXiv preprint arXiv:2503.21776, 2025. 2, 3, 6

  7. [7]

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforce- ment fine-tuning. arXiv preprint arXiv:2504.06958, 2025. 2, 3, 6

  8. [8]

    Exploring the effect of reinforcement learning on video understanding: Insights from seed-bench-r1

    Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Lu Qiu, Ying Shan, and Xihui Liu. Exploring the effect of reinforcement learning on video understanding: Insights from seed-bench-r1. arXiv preprint arXiv:2503.24376, 2025. 2, 3, 6

  9. [9]

    Gemini-2.0-flash-thinking, 2024

    Google. Gemini-2.0-flash-thinking, 2024. 2, 6

  10. [10]

    Thinking in space: How multimodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. arXiv preprint arXiv:2412.14171, 2024. 2, 3

  11. [11]

    Mmworld: Towards multi-discipline multi-faceted world model evaluation in videos.arXiv preprint arXiv:2406.08407, 2024

    Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, Jianfeng Wang, Linjie Li, Zhengyuan Yang, et al. Mmworld: Towards multi-discipline multi-faceted world model evaluation in videos. arXiv preprint arXiv:2406.08407, 2024. 2, 3

  12. [12]

    Mmvu: Measuring expert-level multi-discipline video understanding

    Yilun Zhao, Lujing Xie, Haowei Zhang, Guo Gan, Yitao Long, Zhiyuan Hu, Tongyan Hu, Weiyuan Chen, Chuhan Li, Junyang Song, et al. Mmvu: Measuring expert-level multi-discipline video understanding. arXiv preprint arXiv:2501.12380, 2025. 2, 3

  13. [13]

    Vcr-bench: A comprehensive evaluation framework for video chain-of-thought reasoning

    Yukun Qi, Yiming Zhao, Yu Zeng, Xikun Bao, Wenxuan Huang, Lin Chen, Zehui Chen, Jie Zhao, Zhongang Qi, and Feng Zhao. Vcr-bench: A comprehensive evaluation framework for video chain-of-thought reasoning. arXiv preprint arXiv:2504.07956, 2025. 2, 3

  14. [14]

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos. arXiv preprint arXiv:2501.13826, 2025. 2, 3

  15. [15]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 22195–22206, 2024. 2, 3

  16. [16]

    TempCompass: Do Video LLMs Really Understand Videos?

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? arXiv preprint arXiv:2403.00476, 2024. 2, 3

  17. [17]

    V-star: Bench- marking video-llms on video spatio-temporal reasoning

    Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, and Shaogang Gong. V-star: Bench- marking video-llms on video spatio-temporal reasoning. arXiv preprint arXiv:2503.11495,

  18. [18]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075,

  19. [19]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 3

  20. [20]

    Minigpt-5: Interleaved vision-and-language generation via generative vokens

    Kaizhi Zheng, Xuehai He, and Xin Eric Wang. Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239, 2023. 3

  21. [21]

    Making llama see and draw with seed tokenizer

    Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218, 2023. 3

  22. [22]

    Visdiahalbench: A visual dialogue benchmark for diagnosing hallucination in large vision-language models

    Qingxing Cao, Junhao Cheng, Xiaodan Liang, and Liang Lin. Visdiahalbench: A visual dialogue benchmark for diagnosing hallucination in large vision-language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12161–12176, 2024. 3

  23. [23]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355,

  24. [24]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. 3

  25. [25]

    CogVLM2: Visual Language Models for Image and Video Understanding

    Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, et al. Cogvlm2: Visual language models for image and video understanding. arXiv preprint arXiv:2408.16500, 2024. 3

  26. [26]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 3

  27. [27]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 3, 5

  28. [28]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713, 2024. 3

  29. [29]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 3

  30. [30]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215,

  31. [31]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report. arXiv preprint arXiv:2504.07491, 2025. 3

  32. [32]

    Improved visual-spatial reasoning via r1-zero-like training

    Zhenyi Liao, Qingsong Xie, Yanhao Zhang, Zijian Kong, Haonan Lu, Zhenyu Yang, and Zhijie Deng. Improved visual-spatial reasoning via r1-zero-like training. arXiv preprint arXiv:2504.00883, 2025. 3

  33. [33]

    Open- vlthinker: An early exploration to complex vision-language reasoning via iterative self- improvement

    Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Open- vlthinker: An early exploration to complex vision-language reasoning via iterative self- improvement. arXiv preprint arXiv:2503.17352, 2025. 3 23

  34. [34]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615, 2025. 3

  35. [35]

    LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

    Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step-by-step. arXiv preprint arXiv:2411.10440, 2024. 3

  36. [36]

    Llamav-o1: Rethinking step-by-step visual reasoning in llms

    Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-o1: Rethinking step-by-step visual reasoning in llms. arXiv preprint arXiv:2501.06186, 2025. 3

  37. [37]

    Tinyllava-video-r1: Towards smaller lmms for video reasoning

    Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Tinyllava-video-r1: Towards smaller lmms for video reasoning. arXiv preprint arXiv:2504.09641, 2025. 3

  38. [38]

    Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

    Ye Wang, Boshen Xu, Zihao Yue, Zihan Xiao, Ziheng Wang, Liang Zhang, Dingyi Yang, Wenxuan Wang, and Qin Jin. Timezero: Temporal video grounding with reasoning-guided lvlm. arXiv preprint arXiv:2503.13377, 2025. 3

  39. [39]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923,

  40. [40]

    Video question answering via gradually refined attention over appearance and motion

    Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017. 3

  41. [41]

    Activitynet-qa: A dataset for understanding complex web videos via question answering

    Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9127–9134,

  42. [42]

    Next-qa: Next phase of question- answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021. 3

  43. [43]

    Mmbench: Benchmarking end-to-end multi-modal dnns and understanding their hardware-software implications

    Cheng Xu, Xiaofeng Hou, Jiacheng Liu, Chao Li, Tianhao Huang, Xiaozhi Zhu, Mo Niu, Lingyu Sun, Peng Tang, Tongqiao Xu, et al. Mmbench: Benchmarking end-to-end multi-modal dnns and understanding their hardware-software implications. In 2023 IEEE International Symposium on Workload Characterization (IISWC), pages 154–166. IEEE, 2023. 3

  44. [44]

    Longvideobench: A benchmark for long- context interleaved video-language understanding

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long- context interleaved video-language understanding. Advances in Neural Information Processing Systems, 37:28828–28857, 2024. 3

  45. [45]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271,

  46. [46]

    Introducing gemini 2.0: our new ai model for the agentic era, 2024

    Sundar Pichai, D Hassabis, and K Kavukcuoglu. Introducing gemini 2.0: our new ai model for the agentic era, 2024. 6

  47. [47]

    Gemini-2.0-pro, 2025

    Google. Gemini-2.0-pro, 2025. 6

  48. [48]

    Gemini-2.5-pro, 2025

    Google. Gemini-2.5-pro, 2025. 6

  49. [49]

    Hello gpt-4o, 2024

    OpenAI. Hello gpt-4o, 2024. 6

  50. [50]

    o4-mini, 2025

    OpenAI. o4-mini, 2025. 6

  51. [51]

    The claude 3 model family: Opus, sonnet, haiku, 2024

    Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. 6 24