arxiv: 2604.05015 · v1 · submitted 2026-04-06 · 💻 cs.CV

Recognition: no theorem link

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Caifeng Shan, Chaoyou Fu, Chengwu Long, Haoyu Cao, Haozhi Yuan, Jinsen Su, Ran He, Xiaoxing Hu, Xiaoyao Xie, Xiawu Zheng, Xing Sun, Xue Yang, Xueying Li, Yi-Fan Zhang, Yongkang Xie, Yuhao Dong, Yunhang Shen, Yunsheng Wu, Ziwei Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords video understandingbenchmarkmultimodal reasoningtemporal modelingvisual aggregationevaluation strategyvideo MLLM

0 comments

The pith

Video-MME-v2 shows current models lag human experts because errors in visual aggregation and temporal modeling block higher-level reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates Video-MME-v2 to replace saturated existing benchmarks that give inflated scores without measuring real video comprehension. It structures evaluation around a tri-level hierarchy that starts with multi-point visual aggregation, moves to temporal dynamics, and ends with complex multimodal reasoning, while using group-based non-linear scoring that withholds credit for isolated correct guesses and demands consistency across related questions. Experiments on this benchmark expose a wide gap between the top model and humans, with lower-level mistakes cascading upward and reasoning often depending on subtitles rather than pure visuals. Readers should care because the work isolates concrete bottlenecks that must be fixed before video AI can handle realistic tasks.

Core claim

Video-MME-v2 establishes that the leading model falls substantially short of human experts on comprehensive video understanding. Mistakes in visual information aggregation and temporal dynamics modeling propagate to limit performance at the level of complex multimodal reasoning. Thinking-based reasoning improves when subtitles are available but sometimes degrades in purely visual settings.

What carries the argument

The progressive tri-level hierarchy, which incrementally raises complexity from multi-point visual information aggregation through temporal dynamics modeling to complex multimodal reasoning, together with the group-based non-linear evaluation strategy that enforces consistency across related queries and withholds credit for fragmented or guess-based answers.

If this is right

Advancing video understanding requires targeted gains in visual information aggregation and temporal modeling before complex reasoning can improve.
Current models rely on textual cues such as subtitles to support thinking-based reasoning, with performance sometimes dropping when those cues are absent.
Standard per-question accuracy overestimates capabilities by crediting answers that lack coherence across related questions.
Future model development should prioritize architectures that maintain fidelity across visual details and time rather than compensating at the reasoning stage alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training objectives that explicitly supervise lower-level visual and temporal tasks may produce faster gains on high-level reasoning benchmarks than end-to-end reasoning training alone.
The same hierarchical structure could be applied to audio-video or long-horizon video tasks to diagnose whether similar error propagation occurs.
Architectures that preserve fine-grained visual information over extended sequences would be a direct test of whether the observed bottlenecks can be narrowed.

Load-bearing premise

The group-based non-linear evaluation and tri-level hierarchy accurately measure genuine video understanding without introducing their own biases or inconsistencies.

What would settle it

A result in which models reach human-level scores on the highest reasoning level while still failing the visual aggregation and temporal modeling levels under the same evaluation protocol would undermine the claimed hierarchical bottleneck.

Figures

Figures reproduced from arXiv: 2604.05015 by Caifeng Shan, Chaoyou Fu, Chengwu Long, Haoyu Cao, Haozhi Yuan, Jinsen Su, Ran He, Xiaoxing Hu, Xiaoyao Xie, Xiawu Zheng, Xing Sun, Xue Yang, Xueying Li, Yi-Fan Zhang, Yongkang Xie, Yuhao Dong, Yunhang Shen, Yunsheng Wu, Ziwei Liu.

**Figure 1.** Figure 1: Left: The three-level capability hierarchy of Video-MME-v2: distribution of capability dimensions across Level 1 (information retrieval and aggregation), Level 2 (temporal understanding), and Level 3 (complex reasoning). Right: Models are ranked by their group-based non-linear scores, while average accuracy is provided for reference only. Due to API limitations, Gemini models are tested by extracting and … view at source ↗

**Figure 3.** Figure 3: Video length and word count statistics [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Video view-count distribution [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Q1–Q4 Accuracy Trends and Stability. Trends under (a) Capability consistency, (b) Reasoning coherence, and (c) mean/variance statistics under capability consistency. 5.2.3 Effect of Thinking Mode on Video-MME-v2 In [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Effect of Thinking Mode on Video-MME-v2. Performance changes induced by enabling Thinking for instruction-tuned baseline models, evaluated under both wo. subtitle and w. subtitle settings. Non-Lin Scores in [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Capability Radar Across Video-MME-v2 Dimensions. For a wider range of models, please visit our project page, where you can select to view the radar chart performance of different models. 6 Conclusion In this paper, we introduce Video-MME-v2, a benchmark designed to comprehensively evaluate the robustness and faithfulness of video MLLMs. We propose a progressive multi-level hierarchy that spans diverse vide… view at source ↗

**Figure 9.** Figure 9: An example in Level 1: Visual Recognition <Question 1>: In the ballroom scene, what color cloak was the assassin wearing at the beginning? <Options 1>: A. Brown. B. Grey. C. White. D. Red. E. Purple. F. Black. G. Green. H. Blue. <Question 2>: In the ballroom scene, what animal is the monster fighting the assassin based on? <Options 2>: A. Spider. B. Jaguar. C. Butterfly. D. Hornet. E. Falcon. F. Praying Ma… view at source ↗

**Figure 10.** Figure 10: An example in Level 2: Temporal Reasoning <Question 1>: Why did the Suns’ player #3 leave the court when the score was 113:114? <Options 1>: A. Because he could not continue due to excessive physical exhaustion. B. Because he could not continue due to a rib injury from a collision. C. Because he could not continue due to an ankle injury. D. Because he was protesting the officiating by refusing to play. E.… view at source ↗

**Figure 11.** Figure 11: An example in Level 3: Entity Persistence Tracking <Question 1>: Does the ball exist underneath any of the shells? <Options 1>: A. No. B. Yes. C. Cannot be determined. <Question 2>: Underneath which shell is the ball located at the end? <Options 2>: A. There is no ball under any shell. B. The third shell. C. The sixth shell. D. The second shell. E. The seventh shell. F. The fifth shell. G. The fourth shel… view at source ↗

read the original abstract

With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a \textbf{progressive tri-level hierarchy} that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a \textbf{group-based non-linear evaluation} strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by \textbf{3,300 human-hours} and up to \textbf{5 rounds} of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Video-MME-v2 adds a tri-level question hierarchy and group-based non-linear scoring to video MLLM benchmarks, with solid annotation effort, but the new metric has no reported validation so the error-propagation claims rest on untested design choices.

read the letter

The paper's main contribution is a new benchmark that stacks questions into three progressive levels—visual aggregation, temporal modeling, and complex reasoning—while replacing per-question accuracy with group scoring that checks consistency across related items and requires coherent reasoning chains. They back this with a careful human annotation pipeline: 12 annotators, 50 reviewers, 3,300 hours, and up to five rounds of checks. Experiments show Gemini-3-Pro still trails humans and that models lean heavily on subtitles for the top reasoning level, sometimes dropping when subtitles are removed. That part is useful and straightforward to follow.

Referee Report

2 major / 2 minor

Summary. The paper introduces Video-MME-v2, a new video understanding benchmark featuring a progressive tri-level hierarchy (multi-point visual aggregation, temporal dynamics modeling, and complex multimodal reasoning) and a group-based non-linear evaluation strategy that enforces consistency across related queries and coherence in multi-step reasoning. It is built via a controlled human annotation pipeline (12 annotators, 50 reviewers, 3300 human-hours, up to 5 QA rounds) and reports experiments showing a substantial performance gap between Gemini-3-Pro and human experts, with lower-level visual/temporal errors propagating to limit high-level reasoning, plus dependence on textual cues.

Significance. If the tri-level hierarchy and non-linear scoring are shown to be reliable, the benchmark could meaningfully advance evaluation of video MLLMs by exposing real limitations in visual-temporal integration and reasoning chains that saturated per-question accuracy metrics obscure. The scale and rigor of the human annotation pipeline is a clear strength that supports data quality claims.

major comments (2)

[evaluation strategy (abstract and methods)] The abstract and evaluation strategy description claim that the group-based non-linear evaluator 'penalizes fragmented or guess-based correctness' and 'assigns credit only to answers supported by valid reasoning,' yet no quantitative validation is reported (e.g., inter-annotator agreement, correlation with standard per-question accuracy, or stability of model rankings when recomputed with conventional metrics). This is load-bearing for the central claim of hierarchical error propagation, as the observed bottlenecks could be produced by the scoring rules themselves.
[experiments and results] The headline experimental result (Gemini-3-Pro vs. humans + propagation from visual/temporal errors to reasoning failures) is measured exclusively with the new tri-level question sets and non-linear evaluator. Without an ablation comparing these scores to ordinary accuracy on the identical data and questions, it is unclear whether the propagation pattern reflects model capabilities or an artifact of the metric design.

minor comments (2)

[abstract] The abstract refers to 'Gemini-3-Pro' without specifying the exact model version or release date; this should be clarified for reproducibility.
[introduction] The paper states the benchmark 'aims to serve as one of the most authoritative video benchmarks' but provides no direct comparison table against prior video benchmarks (e.g., Video-MME-v1, ActivityNet, or others) on question count, duration coverage, or annotation effort.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The comments highlight important aspects of validating our proposed evaluation strategy and ensuring the robustness of our experimental claims. We address each major comment below and will incorporate revisions to strengthen the paper.

read point-by-point responses

Referee: [evaluation strategy (abstract and methods)] The abstract and evaluation strategy description claim that the group-based non-linear evaluator 'penalizes fragmented or guess-based correctness' and 'assigns credit only to answers supported by valid reasoning,' yet no quantitative validation is reported (e.g., inter-annotator agreement, correlation with standard per-question accuracy, or stability of model rankings when recomputed with conventional metrics). This is load-bearing for the central claim of hierarchical error propagation, as the observed bottlenecks could be produced by the scoring rules themselves.

Authors: We agree that quantitative validation of the group-based non-linear evaluator is essential to substantiate our claims and rule out metric artifacts. The initial submission focused on describing the design and its intended properties but did not include explicit supporting analyses. In the revised manuscript, we will add: (1) inter-annotator agreement statistics for the group scoring decisions, drawing on the multi-reviewer quality assurance process (50 reviewers); (2) Pearson/Spearman correlations between the non-linear group scores and conventional per-question accuracy across models; and (3) a comparison of model rankings under both scoring schemes to assess stability. These additions will directly address whether the observed hierarchical propagation is robust or scoring-dependent. revision: yes
Referee: [experiments and results] The headline experimental result (Gemini-3-Pro vs. humans + propagation from visual/temporal errors to reasoning failures) is measured exclusively with the new tri-level question sets and non-linear evaluator. Without an ablation comparing these scores to ordinary accuracy on the identical data and questions, it is unclear whether the propagation pattern reflects model capabilities or an artifact of the metric design.

Authors: We acknowledge that presenting results solely under the new metric leaves open the possibility of metric-specific effects. To resolve this, the revised version will include a dedicated ablation section that recomputes all primary results—including the Gemini-3-Pro vs. human gap and the visual/temporal-to-reasoning error propagation—using both the group-based non-linear evaluator and standard per-question accuracy on the exact same question sets and videos. This will allow direct comparison of patterns and demonstrate that the bottlenecks are not an artifact of the evaluation design. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and empirical results are independent of self-referential derivations.

full rationale

The paper introduces Video-MME-v2 via explicit design choices (tri-level hierarchy from visual aggregation to multimodal reasoning, plus group-based non-linear scoring that penalizes inconsistency). These are presented as definitional construction steps, not derived from equations or prior fitted values. The headline claims (Gemini-3-Pro gap, error propagation) are direct empirical outputs from running the benchmark on models and humans; they do not reduce to the metric definition by construction, nor rely on self-citation chains for their validity. No fitted parameters are renamed as predictions, no uniqueness theorems are imported, and no ansatz is smuggled. The evaluation rules are stated upfront and applied externally, satisfying the self-contained benchmark criterion for score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces a benchmark without mathematical derivations or fitted models, so the ledger contains only domain assumptions about what constitutes robust video understanding evaluation.

axioms (1)

domain assumption Human annotation with multiple reviewers produces reliable ground-truth labels for complex video reasoning tasks
The construction relies on 12 annotators, 50 reviewers, and 5 rounds of QA as the basis for data quality.

pith-pipeline@v0.9.0 · 5669 in / 1291 out tokens · 42720 ms · 2026-05-10T18:44:18.568948+00:00 · methodology

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
cs.CV 2026-05 unverdicted novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs
cs.CV 2026-05 unverdicted novelty 7.0

GridProbe uses posterior probing on a KxK frame grid to adaptively select question-relevant frames, delivering up to 3.36x TFLOPs reduction with accuracy within 1.6 pp of the full-frame baseline on Video-MME-v2.
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
cs.CV 2026-05 unverdicted novelty 7.0

TOC-Bench is an object-track-grounded benchmark that filters for temporally dependent questions and shows Video-LLMs have major weaknesses in event counting, ordering, identity reasoning, and hallucination detection.
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
cs.CV 2026-05 conditional novelty 7.0

TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 7.0

VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 5.0

VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 5.0

VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...
EasyVideoR1: Easier RL for Video Understanding
cs.CV 2026-04 unverdicted novelty 4.0

EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

Reference graph

Works this paper leans on

37 extracted references · 22 canonical work pages · cited by 5 Pith papers · 12 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Qwen2.5-vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025

2025
[3]

Seed2.0 model card: Towards intelligence frontier for real-world complexity, February 2026

ByteDance Seed Team. Seed2.0 model card: Towards intelligence frontier for real-world complexity, February 2026. Model Card

2026
[4]

Insight-v++: Towards advanced long-chain visual reasoning with multimodal large language models.arXiv preprint arXiv:2603.18118, 2026

Yuhao Dong, Zuyan Liu, Shulin Tian, Yongming Rao, and Ziwei Liu. Insight-v++: Towards advanced long-chain visual reasoning with multimodal large language models.arXiv preprint arXiv:2603.18118, 2026

work page arXiv 2026
[5]

Demo-ICL: In-context learning for procedural video knowledge acquisition.arXiv preprint arXiv:2602.08439, 2026

Yuhao Dong, Shulin Tian, Shuai Liu, Shuangrui Ding, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, and Ziwei Liu. Demo-icl: In-context learning for procedural video knowledge acquisition. arXiv preprint arXiv:2602.08439, 2026

work page arXiv 2026
[6]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms. arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review arXiv 2025
[7]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review arXiv 2023
[8]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

2025
[9]

Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, et al. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction. arXiv preprint arXiv:2501.01957, 2025

work page arXiv 2025
[10]

Introducing gemini 3: our most intelligent model that helps you bring any idea to life

Google DeepMind. Introducing gemini 3: our most intelligent model that helps you bring any idea to life. Google Blog, 2025

2025
[11]

Motionbench: Benchmarking and improving fine-grained video mo- tion understanding for vision language models

Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video mo- tion understanding for vision language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8450–8460, 2025

2025
[12]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review arXiv 2025
[13]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025

work page internal anchor Pith review arXiv 2025
[14]

Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency.arXiv preprint arXiv:2502.09621, 2025

Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, et al. Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency.arXiv preprint arXiv:2502.09621, 2025. 13

work page arXiv 2025
[15]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

2024
[16]

Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning,

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning. arXiv preprint arXiv:2504.06958, 2025

work page arXiv 2025
[17]

Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–233. Springer, 2024

2024
[18]

Videoreasonbench: Can mllms perform vision-centric complex video reasoning? arXiv preprint arXiv:2505.23359, 2025

Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y Charles, Xinyu Zhou, and Xu Sun. Videoreasonbench: Can mllms perform vision-centric complex video reasoning? arXiv preprint arXiv:2505.23359, 2025

work page arXiv 2025
[19]

Vcr-bench: A comprehensive evaluation frame- work for video chain-of-thought reasoning.arXiv preprint arXiv:2504.07956, 2025

Yukun Qi, Yiming Zhao, Yu Zeng, Xikun Bao, Wenxuan Huang, Lin Chen, Zehui Chen, Jie Zhao, Zhongang Qi, and Feng Zhao. Vcr-bench: A comprehensive evaluation framework for video chain-of-thought reasoning.arXiv preprint arXiv:2504.07956, 2025

work page arXiv 2025
[20]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. Alibaba Cloud, Technical Report

2026
[21]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review arXiv 2025
[24]

Qwen3.5-omni: Scaling up, toward native omni-modal agi, March 2026

Qwen Team. Qwen3.5-omni: Scaling up, toward native omni-modal agi, March 2026

2026
[25]

Ego-r1: Chain-of-tool-thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025

Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, and Ziwei Liu. Ego-r1: Chain-of-tool-thought for ultra-long egocentric video reasoning.arXiv preprint arXiv:2506.13654, 2025

work page arXiv 2025
[26]

Lvbench: An extreme long video understanding benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958–22967, 2025

2025
[27]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Longvideobench: A benchmark for long- context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828–28857, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long- context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828–28857, 2024

2024
[29]

Mimo-vl technical report, 2025

LLM-Core-Team Xiaomi. Mimo-vl technical report, 2025

2025
[30]

Xiaomi mimo-v2-omni: See, hear, act in the agentic era

Xiaomi Corporation. Xiaomi mimo-v2-omni: See, hear, act in the agentic era. https://mimo.xiaomi. com/mimo-v2-omni, 2026

2026
[31]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review arXiv 2025
[32]

Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception,

Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, and Yi Wang. Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception.arXiv preprint arXiv:2509.21100, 2025

work page arXiv 2025
[33]

Kwai keye-vl 1.5 technical report.arXiv preprint arXiv:2509.01563, 2025

Biao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, et al. Kwai keye-vl 1.5 technical report.arXiv preprint arXiv:2509.01563, 2025. 14

work page arXiv 2025
[34]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025

work page internal anchor Pith review arXiv 2025
[35]

Towards video thinking test: A holistic benchmark for advanced video reasoning and understanding

Yuanhan Zhang, Yunice Chew, Yuhao Dong, Aria Leo, Bo Hu, and Ziwei Liu. Towards video thinking test: A holistic benchmark for advanced video reasoning and understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20626–20636, 2025

2025
[36]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024

work page internal anchor Pith review arXiv 2024
[37]

right af- ter X

Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, et al. Mmvu: Measuring expert-level multi-discipline video under- standing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8475–8489, 2025. 15 Table 4: Video-MME v2 leaf-level task definitions (full versio...

2025