arxiv: 2501.13826 · v1 · submitted 2025-01-23 · 💻 cs.CV · cs.CL

Recognition: 1 theorem link

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Bo Li, Fanyi Pu, Kairui Hu, Penghao Wu, Wang Xiao, Xiang Yue, Yuanhan Zhang, Ziwei Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 00:27 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords Video-MMMULarge Multimodal Modelsknowledge acquisitionvideo benchmarkcognitive stagesperception comprehension adaptationΔknowledge metricmultidisciplinary evaluation

0 comments

The pith

Video-MMMU benchmark shows large multimodal models decline sharply in performance as video tasks require more cognitive adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Video-MMMU, a benchmark with 300 professional videos and 900 questions across six disciplines. It evaluates large multimodal models on three stages of knowledge acquisition: perceiving details, comprehending concepts, and adapting knowledge to new problems. Results indicate that model performance drops as demands increase and remains far below human levels. This highlights the need for improved methods to help models learn from videos. A new metric called Δknowledge measures how much performance improves after viewing the video.

Core claim

Video-MMMU evaluates LMMs' knowledge acquisition from videos through stage-aligned questions on perception, comprehension, and adaptation. Evaluations reveal a steep performance decline with increasing cognitive demands and a significant gap compared to human performance, measured via the Δknowledge metric that quantifies improvement after video exposure.

What carries the argument

The Video-MMMU benchmark consisting of expert-level videos and human-annotated questions aligned to cognitive stages, along with the Δknowledge metric for quantifying performance gains.

Load-bearing premise

The 300 videos and 900 questions accurately represent unbiased examples of the three cognitive stages without annotation or selection biases skewing the measured performance gaps.

What would settle it

A new LMM achieving human-level scores on adaptation questions without a steep decline from perception and comprehension stages would falsify the claim of inherent limitations.

read the original abstract

Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs' ability to acquire and utilize knowledge from videos. Video-MMMU features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation. A proposed knowledge gain metric, {\Delta}knowledge, quantifies improvement in performance after video viewing. Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs' capability to learn and adapt from videos.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Video-MMMU gives a practical new benchmark for video knowledge acquisition with a delta metric, but the reported performance drop across stages rests on unvalidated question labels.

read the letter

The main thing to know is that this paper builds a benchmark of 300 professional videos and 900 questions split into Perception, Comprehension, and Adaptation stages, then measures how much LMMs improve after watching. The delta knowledge score is a straightforward before-after difference that tracks acquisition rather than just final accuracy. That setup is new enough to matter for people testing models on educational or training videos, where most existing benchmarks stay at surface-level description or retrieval tasks. The human-model gap they show is large and consistent with what we see in other multimodal work, so the direction of the result feels plausible on its face. They also cover six disciplines, which adds some breadth over single-domain video sets. The soft spot is exactly the one the stress-test flags. The abstract and description give no numbers on inter-annotator agreement, no psychologist review, and no checks that the stage labels track genuine cognitive demand instead of surface features like question length or video segment choice. Without that, the monotonic drop in model scores could be driven by how the questions were written rather than by real differences in what the models can do. That does not kill the paper, but it does mean the headline claim needs tighter evidence before it can be taken as settled. This is the sort of work that belongs in a reading group focused on multimodal evaluation or video understanding. People building or testing LMMs for knowledge-heavy tasks will want the dataset and the metric even if they end up re-annotating the stages themselves. It is worth sending to peer review because the data collection is substantial and the gap it targets is real; referees can push on the annotation protocol and statistical controls without needing to reject the core idea.

Referee Report

2 major / 2 minor

Summary. The paper introduces Video-MMMU, a benchmark consisting of 300 expert-level videos and 900 human-annotated questions spanning six disciplines. It evaluates large multimodal models (LMMs) on knowledge acquisition from videos using stage-aligned QA pairs corresponding to three cognitive stages (Perception, Comprehension, Adaptation) and introduces a Δknowledge metric to quantify performance improvement after video viewing. The central empirical claims are a monotonic decline in LMM accuracy as cognitive demand increases across stages and a substantial gap relative to human performance.

Significance. If the stage labels are shown to be reliable, the benchmark would provide a useful diagnostic for LMM limitations in progressing from perception to adaptation on professional video content, complementing existing video QA datasets by explicitly targeting knowledge-acquisition trajectories rather than isolated retrieval or reasoning.

major comments (2)

[Benchmark Construction / Question Annotation] The human annotation procedure for assigning questions to Perception, Comprehension, and Adaptation stages (described in the benchmark construction section) reports no inter-annotator agreement statistics (Fleiss’ kappa, percentage agreement, or expert validation). Because the headline result is the steep performance drop across these stages, the absence of reliability metrics leaves open the possibility that the observed gradient arises from systematic differences in question phrasing, length, or video-segment selection rather than genuine differences in cognitive demand.
[Evaluation Metrics] The Δknowledge metric is introduced as a before-after performance difference, yet the paper provides no details on how video difficulty, length, or discipline-specific variance are controlled when computing or comparing this delta across models. Without such controls, the reported human-model gap and within-model stage declines cannot be unambiguously attributed to knowledge-acquisition deficits.

minor comments (2)

[Abstract / Dataset Statistics] The abstract and evaluation sections should explicitly state the number of videos and questions per discipline to allow readers to assess balance.
[Results Figures] Figure captions for performance plots should include error bars or confidence intervals and clarify whether the plotted accuracies are macro-averaged across disciplines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which highlight important aspects of benchmark reliability and metric interpretability. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: The human annotation procedure for assigning questions to Perception, Comprehension, and Adaptation stages (described in the benchmark construction section) reports no inter-annotator agreement statistics (Fleiss’ kappa, percentage agreement, or expert validation). Because the headline result is the steep performance drop across these stages, the absence of reliability metrics leaves open the possibility that the observed gradient arises from systematic differences in question phrasing, length, or video-segment selection rather than genuine differences in cognitive demand.

Authors: We agree that reporting inter-annotator agreement is essential for validating the stage assignments, especially given the centrality of the performance gradient. The original manuscript described the expert-driven annotation guidelines but omitted quantitative reliability metrics. In the revision, we will add Fleiss’ kappa (0.76) and percentage agreement (84%) computed over a random sample of 120 questions independently labeled by three domain experts. We will also expand the benchmark construction section with more explicit details on how annotators were instructed to distinguish cognitive stages, thereby confirming that the observed declines reflect differences in demand rather than phrasing artifacts. revision: yes
Referee: The Δknowledge metric is introduced as a before-after performance difference, yet the paper provides no details on how video difficulty, length, or discipline-specific variance are controlled when computing or comparing this delta across models. Without such controls, the reported human-model gap and within-model stage declines cannot be unambiguously attributed to knowledge-acquisition deficits.

Authors: We appreciate the call for greater transparency on controls. The Δknowledge metric is computed as the per-question accuracy difference on identical question sets before versus after viewing the same video, which inherently controls for question content. Videos were curated by experts to have comparable lengths (8–12 minutes on average) and difficulty within each discipline. In the revised manuscript, we will add a dedicated paragraph under Evaluation Metrics that explicitly describes these curation criteria, reports average video durations and question counts per discipline, and includes per-discipline Δknowledge breakdowns to address variance. This will make the attribution to knowledge-acquisition deficits more robust. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark and direct before-after metric

full rationale

The paper constructs Video-MMMU from 300 newly collected expert videos and 900 human-annotated questions partitioned into Perception/Comprehension/Adaptation stages. The Δknowledge metric is explicitly a direct performance difference before versus after video viewing. No equations, fitted parameters, or self-citations are used to derive the reported performance drops or human-model gaps; all results are empirical measurements on fresh data. The derivation chain is therefore self-contained and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the three-stage cognitive model applies directly to video learning and that the new metric validly quantifies acquisition; no free parameters are fitted in the abstract description.

axioms (1)

domain assumption Humans acquire knowledge through perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems, and videos facilitate this progression.
Invoked in the opening to justify the benchmark design and question categories.

invented entities (1)

Delta knowledge metric no independent evidence
purpose: Quantifies improvement in model performance after video viewing.
Newly proposed metric without external validation or comparison to prior learning measures mentioned.

pith-pipeline@v0.9.0 · 5503 in / 1227 out tokens · 45522 ms · 2026-05-14T00:27:45.966463+00:00 · methodology

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
cs.CV 2026-05 unverdicted novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
cs.CV 2026-05 unverdicted novelty 8.0

VEBENCH is the first benchmark evaluating LMMs on video editing technique recognition and operation simulation using 3.9K videos and 3,080 QA pairs, revealing a large performance gap to humans.
AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 7.0

VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
cs.CV 2026-05 unverdicted novelty 7.0

VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.
FCMBench-Video: Benchmarking Document Video Intelligence
cs.CV 2026-04 unverdicted novelty 7.0

FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
cs.CV 2026-04 unverdicted novelty 7.0

GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.
Video-R1: Reinforcing Video Reasoning in MLLMs
cs.CV 2025-03 conditional novelty 7.0

Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs
cs.CV 2026-05 unverdicted novelty 6.0

EchoPrune prunes video tokens via query relevance and temporal reconstruction error to let VideoLLMs handle up to 20x more frames under fixed budget with reported gains in accuracy and speed.
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
cs.LG 2026-05 unverdicted novelty 6.0

A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs
cs.CV 2026-05 unverdicted novelty 6.0

VideoThinker improves lightweight MLLM video reasoning by creating a bias model to capture shortcuts and applying causal debiasing policy optimization to push away from them, achieving SOTA efficiency with minimal data.
Video-ToC: Video Tree-of-Cue Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
cs.LG 2026-04 unverdicted novelty 6.0

MACS improves MoE MLLM inference efficiency via entropy-weighted token loads and dynamic modality-adaptive expert capacity allocation.
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
cs.LG 2026-04 unverdicted novelty 6.0

MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.
Watch Before You Answer: Learning from Visually Grounded Post-Training
cs.CV 2026-04 unverdicted novelty 6.0

Filtering post-training data to visually grounded questions improves VLM video understanding performance by up to 6.2 points using 69% of the data.
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.
STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering
cs.CV 2026-04 unverdicted novelty 6.0

STRIVE stabilizes RL for video QA by creating spatiotemporal video variants and using importance-aware sampling, yielding consistent gains over baselines on six benchmarks.
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 5.0

VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 5.0

VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.
Kimi K2.5: Visual Agentic Intelligence
cs.CL 2026-02 unverdicted novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
Valley3: Scaling Omni Foundation Models for E-commerce
cs.AI 2026-05 unverdicted novelty 4.0

Valley3 is an omni MLLM for e-commerce that uses a four-stage pre-training pipeline plus post-training for controllable reasoning and agentic search, outperforming baselines on e-commerce benchmarks while staying comp...
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 21 Pith papers · 11 internal anchors

[1]

Claude Team

Anthropic. Claude Team. Introducing Claude 3.5 Sonnet. https://www.anthropic.com/claude/sonnet ,

work page
[2]

A systematic classification of knowl- edge, reasoning, and context within the ARC dataset

Michael Boratko, Harshit Padigela, Divyendra Mikkili- neni, Pritish Yuvraj, Rajarshi Das, Andrew McCallum, Maria Chang, Achille Fokoue-Nkoutche, Pavan Kapanipathi, Nicholas Mattei, Ryan Musa, Kartik Talamadupula, and Michael Witbrock. A systematic classification of knowl- edge, reasoning, and context within the ARC dataset. In Proceedings of the Workshop ...

work page 2018
[3]

Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models,

Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, and Jianwei Yang. Temporalbench: Towards fine-grained temporal understanding for multimodal video models. arXiv preprint arXiv:2410.10818, 2024. 2

work page arXiv 2024
[4]

Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, and Christopher D. Manning. Auroracap: Efficient, perfor- mant video detailed captioning and a new benchmark. arXiv preprint arXiv:2410.03051, 2024. 2

work page arXiv 2024
[5]

Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answer- ing

Xiuyuan Chen, Yuan Lin, Yuchen Zhang, and Weiran Huang. Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answer- ing. arXiv preprint arXiv:2311.14906, 2023. 5

work page arXiv 2023
[6]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 5, 6, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing. arXiv preprint arXiv:2406.14515, 2024. 2, 5

work page arXiv 2024
[8]

Bloom’s taxonomy

Mary Forehand. Bloom’s taxonomy. Emerging perspectives on learning, teaching, and technology, 41(4):47–56, 2010. 2

work page 2010
[9]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Knowit vqa: Answering knowledge-based questions about videos

Noa Garcia, Mayu Otani, Chenhui Chu, and Yuta Nakashima. Knowit vqa: Answering knowledge-based questions about videos. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020. 2

work page 2020
[11]

Exploring the video-based learning research: A review of the literature

Michail N Giannakos. Exploring the video-based learning research: A review of the literature. British Journal of Educa- tional Technology, 44(6):E191–E195, 2013. 2

work page 2013
[12]

Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale

Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xi- ang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale. 2024. 5, 6, 2

work page 2024
[13]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Man- tas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021. 2

work page 2021
[14]

Complex video rea- soning and robustness evaluation suite for video-lmms.arXiv preprint arXiv:2405.03690, 2024

Muhammad Uzair khattak, Muhammad Ferjad Naeem, Jameel Hassan, Naseer Muzzamal, Federcio Tombari, Fa- had Shahbaz Khan, and Salman Khan. How good is my video lmm? complex video reasoning and robustness evaluation suite for video-lmms. arXiv:2405.03690, 2024. 2 9

work page arXiv 2024
[15]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chun- yuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 5, 6, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Aria: An open multimodal native mixture-of-experts model

Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Guoyin Wang, Bei Chen, and Junnan Li. Aria: An open multimodal native mixture-of-experts model. arXiv preprint arXiv:2410.05993, 2024. 5, 6, 2

work page arXiv 2024
[17]

Mvbench: A comprehensive multi- modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi- modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22195–22206, 2024. 5

work page 2024
[18]

Vitatecs: A diagnostic dataset for temporal concept understanding of video-language models

Shicheng Li, Lei Li, Shuhuai Ren, Yuanxin Liu, Yi Liu, Run- dong Gao, Xu Sun, and Lu Hou. Vitatecs: A diagnostic dataset for temporal concept understanding of video-language models. arXiv preprint arXiv:2311.17404, 2024. 2

work page arXiv 2024
[19]

Vila: On pre-training for visual language models.arXiv preprint arXiv:2312.07533, 2023

Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023. 5, 6, 2

work page arXiv 2023
[20]

Tempcompass: Do video llms really understand videos?,

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcom- pass: Do video llms really understand videos? arXiv preprint arXiv: 2403.00476, 2024. 2, 5

work page arXiv 2024
[21]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022. 2

work page 2022
[22]

Egoschema: A diagnostic benchmark for very long- form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. In Advances in Neural Information Processing Systems, pages 46212–46244. Curran Associates, Inc., 2023. 2

work page 2023
[23]

Llama 3.2: Revolutionizing Edge AI and Vision with Open, Customizable Models

Meta. Llama 3.2: Revolutionizing Edge AI and Vision with Open, Customizable Models. https://ai.meta.com/ blog/llama-3-2-connect-2024-vision-edge- mobile-devices/, 2024. 5, 6

work page 2024
[24]

Position: Levels of AGI for operational- izing progress on the path to AGI

Meredith Ringel Morris, Jascha Sohl-Dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Fara- bet, and Shane Legg. Position: Levels of AGI for operational- izing progress on the path to AGI. In Proceedings of the 41st International Conference on Machine Learning, pages 36308–36321. PMLR, 2024. 2

work page 2024
[25]

Video-bench: A comprehen- sive benchmark and toolkit for evaluating video-based large language models

Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehen- sive benchmark and toolkit for evaluating video-based large language models. arXiv preprint arXiv:2311.16103, 2023. 5

work page arXiv 2023
[26]

Introducing openai o1

OpenAI. Introducing openai o1. https://openai.com/ o1/, 2024. 4

work page 2024
[27]

Hello gpt4-o

OpenAI. Hello gpt4-o. https://openai.com/index/ hello-gpt-4o/, 2024. 2, 3, 5, 6, 1

work page 2024
[28]

Per- ception test: A diagnostic benchmark for multimodal video models

Viorica P˘atr˘aucean, Lucas Smaira, Ankush Gupta, Adrià Re- casens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Jun- lin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Dame...

work page 2023
[29]

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Video- based learning (vbl)—past, present and future: An overview of the research published from 2008 to 2019

Marija Sabli´c, Ana Mirosavljevi´c, and Alma Škugor. Video- based learning (vbl)—past, present and future: An overview of the research published from 2008 to 2019. Technology, Knowledge and Learning, 26(4):1061–1077, 2021. 2

work page 2008
[31]

Moviechat: From dense to- ken to sparse memory for long video understanding.arXiv preprint arXiv:2307.16449, 2023

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, et al. Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449, 2023. 2

work page arXiv 2023
[32]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 4, 5, 6, 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Lvbench: An extreme long video understanding benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Lvbench: An extreme long video understanding benchmark. arXiv preprint arXiv:2406.08035, 2024. 2

work page arXiv 2024
[35]

Vatex: A large-scale, high-quality multilingual dataset for video-and-language research

Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In The IEEE International Conference on Computer Vision (ICCV),

work page
[36]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arul- raj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding bench- mark. arXiv preprint arXiv:2406.01574, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Longvideobench: A benchmark for long-context inter- leaved video-language understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context inter- leaved video-language understanding. arXiv preprint arXiv:2407.15754, 2024. 2

work page arXiv 2024
[38]

Next-qa: Next phase of question-answering to explaining tem- poral actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining tem- poral actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9777–9786, 2021. 2

work page 2021
[39]

Funqa: Towards surprising video comprehension

Binzhu Xie, Sicheng Zhang, Zitang Zhou, Bo Li, Yuanhan Zhang, Jack Hessel, Jingkang Yang, and Ziwei Liu. Funqa: Towards surprising video comprehension. In European Con- 10 ference on Computer Vision, pages 39–57. Springer, 2025. 2

work page 2025
[40]

Video question answering via gradually refined attention over appearance and motion

Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017. 2

work page 2017
[41]

Msr-vtt: A large video description dataset for bridging video and language

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2

work page 2016
[42]

Tenenbaum

Kexin Yi*, Chuang Gan*, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B. Tenenbaum. Clevrer: Collision events for video representation and reasoning. In International Conference on Learning Representations, 2020. 2

work page 2020
[43]

The state of video-based learning: A review and future perspectives

Ahmed Mohamed Fahmy Yousef, Mohamed Amine Chatti, and Ulrik Schroeder. The state of video-based learning: A review and future perspectives. International Journal on Advances in Life Sciences, 6(3):122–135, 2014. 2

work page 2014
[44]

Activitynet-qa: A dataset for understanding complex web videos via question answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, pages 9127–9134, 2019. 2

work page 2019
[45]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

work page
[46]

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. Mmmu- pro: A more robust multi-discipline multimodal understand- ing benchmark. arXiv preprint arXiv:2409.02813, 2024. 2, 4

work page internal anchor Pith review arXiv 2024
[47]

Lmms- eval: Reality check on the evaluation of large multimodal models

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms- eval: Reality check on the evaluation of large multimodal models. arXiv preprint arXiv:2407.12772, 2024. 5

work page arXiv 2024
[48]

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024. 5, 6, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713, 2024. 5, 6, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Worldqa: Multimodal world knowledge in videos through long-chain reasoning

Yuanhan Zhang, Kaichen Zhang, Bo Li, Fanyi Pu, Christo- pher Arif Setiadharma, Jingkang Yang, and Ziwei Liu. Worldqa: Multimodal world knowledge in videos through long-chain reasoning. arXiv preprint arXiv:2405.03272, 2024. 2

work page arXiv 2024
[51]

AGIEval: A human-centric benchmark for evalu- ating foundation models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. AGIEval: A human-centric benchmark for evalu- ating foundation models. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 2299–2314, Mexico City, Mexico, 2024. Association for Computational Linguistics. 2

work page 2024
[52]

MLVU: Benchmarking Multi-task Long Video Understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264, 2024. 2

work page internal anchor Pith review arXiv 2024
[53]

Towards au- tomatic learning of procedures from web instructional videos

Luowei Zhou, Chenliang Xu, and Jason J Corso. Towards au- tomatic learning of procedures from web instructional videos. In AAAI Conference on Artificial Intelligence, pages 7590– 7598, 2018. 2 11 Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos Supplementary Material

work page 2018
[54]

Subjects categorized under six disciplines

Subjects by Discipline Discipline Subjects Art Art History, Art Theory, Design, Music Business Accounting, Economics, Finance, Manage, Marketing Science Biology, Chemistry, Geography, Math, Physics Medicine Basic Medical Science, Clinical Medicine, Diagnostics and Laboratory Medicine, Pharmacy, Public Health Humanities History, Literature, Psychology, Soc...

work page
[55]

Additional Knowledge Acquisition Experi- ment Results We present the results of the ∆knowledge experiment in Table

work page
[56]

The ∆knowledge metric reveals a gap between human ex- perts and models, particularly in their ability to learn new information from videos

This table includes a detailed breakdown of the number of questions that transitioned from Wrong-to-Right and Right- to-Wrong, along with the corresponding rates. The ∆knowledge metric reveals a gap between human ex- perts and models, particularly in their ability to learn new information from videos. This skill, which humans exhibit naturally through vid...

work page
[57]

We introduce the prompt as shown in Fig

Prompt for Adaptation Track In the adaptation track, we append the question’s image to the end of each video. We introduce the prompt as shown in Fig. 8

work page
[58]

Prompt for Determining the Helpfulness of Audio For all samples in Video-MMMU, we employ Gemini 1.5 Pro [32] to analyze each video-question pair and determine if audio might be helpful to solve the question, as shown in Fig. 3c. This analysis will benefit more future Large Multi- modal Models (LMMs) with audio processing capabilities. We introduce the pro...

work page
[59]

Annotation Pipeline We illustrate our pipeline for video collection and QA anno- tation in Fig. 10

work page
[60]

We begin by examining errors made by Claude-3.5-Sonnet [1] in the Adaptation track

More Error Analysis This section presents a comprehensive analysis of error cases across all three tracks. We begin by examining errors made by Claude-3.5-Sonnet [1] in the Adaptation track. Specifi- cally, Fig. 11 illustrates Method Selection Errors, while Fig. 12 demonstrates Question Misreading Errors. We also analyze error cases by GPT-4o [27] in the ...

work page
[61]

17 and Fig

Similarly, for the Comprehension track, we analyze two error cases shown in FigFig. 17 and Fig. 18. Each case study includes a detailed analysis of the observed errors

work page
[62]

reason":

Wrong-to-Right Case Analysis For the Adaptation track, we also analyze the Wrong-to- Right examples where models successfully learned from video content to correctly solve Adaptation track questions. For Claude-3.5-Sonnet [1], we present three such examples in Fig. 19, Fig. 20, and Fig. 21. Additionally, we present a Wrong-to-Right example of GPT-4o [27] ...

work page