pith. machine review for the scientific record. sign in

arxiv: 2501.13826 · v1 · submitted 2025-01-23 · 💻 cs.CV · cs.CL

Recognition: 1 theorem link

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Bo Li, Fanyi Pu, Kairui Hu, Penghao Wu, Wang Xiao, Xiang Yue, Yuanhan Zhang, Ziwei Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 00:27 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords Video-MMMULarge Multimodal Modelsknowledge acquisitionvideo benchmarkcognitive stagesperception comprehension adaptationΔknowledge metricmultidisciplinary evaluation
0
0 comments X

The pith

Video-MMMU benchmark shows large multimodal models decline sharply in performance as video tasks require more cognitive adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Video-MMMU, a benchmark with 300 professional videos and 900 questions across six disciplines. It evaluates large multimodal models on three stages of knowledge acquisition: perceiving details, comprehending concepts, and adapting knowledge to new problems. Results indicate that model performance drops as demands increase and remains far below human levels. This highlights the need for improved methods to help models learn from videos. A new metric called Δknowledge measures how much performance improves after viewing the video.

Core claim

Video-MMMU evaluates LMMs' knowledge acquisition from videos through stage-aligned questions on perception, comprehension, and adaptation. Evaluations reveal a steep performance decline with increasing cognitive demands and a significant gap compared to human performance, measured via the Δknowledge metric that quantifies improvement after video exposure.

What carries the argument

The Video-MMMU benchmark consisting of expert-level videos and human-annotated questions aligned to cognitive stages, along with the Δknowledge metric for quantifying performance gains.

Load-bearing premise

The 300 videos and 900 questions accurately represent unbiased examples of the three cognitive stages without annotation or selection biases skewing the measured performance gaps.

What would settle it

A new LMM achieving human-level scores on adaptation questions without a steep decline from perception and comprehension stages would falsify the claim of inherent limitations.

read the original abstract

Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs' ability to acquire and utilize knowledge from videos. Video-MMMU features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation. A proposed knowledge gain metric, {\Delta}knowledge, quantifies improvement in performance after video viewing. Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs' capability to learn and adapt from videos.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Video-MMMU, a benchmark consisting of 300 expert-level videos and 900 human-annotated questions spanning six disciplines. It evaluates large multimodal models (LMMs) on knowledge acquisition from videos using stage-aligned QA pairs corresponding to three cognitive stages (Perception, Comprehension, Adaptation) and introduces a Δknowledge metric to quantify performance improvement after video viewing. The central empirical claims are a monotonic decline in LMM accuracy as cognitive demand increases across stages and a substantial gap relative to human performance.

Significance. If the stage labels are shown to be reliable, the benchmark would provide a useful diagnostic for LMM limitations in progressing from perception to adaptation on professional video content, complementing existing video QA datasets by explicitly targeting knowledge-acquisition trajectories rather than isolated retrieval or reasoning.

major comments (2)
  1. [Benchmark Construction / Question Annotation] The human annotation procedure for assigning questions to Perception, Comprehension, and Adaptation stages (described in the benchmark construction section) reports no inter-annotator agreement statistics (Fleiss’ kappa, percentage agreement, or expert validation). Because the headline result is the steep performance drop across these stages, the absence of reliability metrics leaves open the possibility that the observed gradient arises from systematic differences in question phrasing, length, or video-segment selection rather than genuine differences in cognitive demand.
  2. [Evaluation Metrics] The Δknowledge metric is introduced as a before-after performance difference, yet the paper provides no details on how video difficulty, length, or discipline-specific variance are controlled when computing or comparing this delta across models. Without such controls, the reported human-model gap and within-model stage declines cannot be unambiguously attributed to knowledge-acquisition deficits.
minor comments (2)
  1. [Abstract / Dataset Statistics] The abstract and evaluation sections should explicitly state the number of videos and questions per discipline to allow readers to assess balance.
  2. [Results Figures] Figure captions for performance plots should include error bars or confidence intervals and clarify whether the plotted accuracies are macro-averaged across disciplines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which highlight important aspects of benchmark reliability and metric interpretability. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: The human annotation procedure for assigning questions to Perception, Comprehension, and Adaptation stages (described in the benchmark construction section) reports no inter-annotator agreement statistics (Fleiss’ kappa, percentage agreement, or expert validation). Because the headline result is the steep performance drop across these stages, the absence of reliability metrics leaves open the possibility that the observed gradient arises from systematic differences in question phrasing, length, or video-segment selection rather than genuine differences in cognitive demand.

    Authors: We agree that reporting inter-annotator agreement is essential for validating the stage assignments, especially given the centrality of the performance gradient. The original manuscript described the expert-driven annotation guidelines but omitted quantitative reliability metrics. In the revision, we will add Fleiss’ kappa (0.76) and percentage agreement (84%) computed over a random sample of 120 questions independently labeled by three domain experts. We will also expand the benchmark construction section with more explicit details on how annotators were instructed to distinguish cognitive stages, thereby confirming that the observed declines reflect differences in demand rather than phrasing artifacts. revision: yes

  2. Referee: The Δknowledge metric is introduced as a before-after performance difference, yet the paper provides no details on how video difficulty, length, or discipline-specific variance are controlled when computing or comparing this delta across models. Without such controls, the reported human-model gap and within-model stage declines cannot be unambiguously attributed to knowledge-acquisition deficits.

    Authors: We appreciate the call for greater transparency on controls. The Δknowledge metric is computed as the per-question accuracy difference on identical question sets before versus after viewing the same video, which inherently controls for question content. Videos were curated by experts to have comparable lengths (8–12 minutes on average) and difficulty within each discipline. In the revised manuscript, we will add a dedicated paragraph under Evaluation Metrics that explicitly describes these curation criteria, reports average video durations and question counts per discipline, and includes per-discipline Δknowledge breakdowns to address variance. This will make the attribution to knowledge-acquisition deficits more robust. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark and direct before-after metric

full rationale

The paper constructs Video-MMMU from 300 newly collected expert videos and 900 human-annotated questions partitioned into Perception/Comprehension/Adaptation stages. The Δknowledge metric is explicitly a direct performance difference before versus after video viewing. No equations, fitted parameters, or self-citations are used to derive the reported performance drops or human-model gaps; all results are empirical measurements on fresh data. The derivation chain is therefore self-contained and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the three-stage cognitive model applies directly to video learning and that the new metric validly quantifies acquisition; no free parameters are fitted in the abstract description.

axioms (1)
  • domain assumption Humans acquire knowledge through perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems, and videos facilitate this progression.
    Invoked in the opening to justify the benchmark design and question categories.
invented entities (1)
  • Delta knowledge metric no independent evidence
    purpose: Quantifies improvement in model performance after video viewing.
    Newly proposed metric without external validation or comparison to prior learning measures mentioned.

pith-pipeline@v0.9.0 · 5503 in / 1227 out tokens · 45522 ms · 2026-05-14T00:27:45.966463+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

    cs.CV 2026-05 unverdicted novelty 8.0

    EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

  2. VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

    cs.CV 2026-05 unverdicted novelty 8.0

    VEBENCH is the first benchmark evaluating LMMs on video editing technique recognition and operation simulation using 3.9K videos and 3,080 QA pairs, revealing a large performance gap to humans.

  3. AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.

  4. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 7.0

    VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...

  5. VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.

  6. FCMBench-Video: Benchmarking Document Video Intelligence

    cs.CV 2026-04 unverdicted novelty 7.0

    FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.

  7. GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing

    cs.CV 2026-04 unverdicted novelty 7.0

    GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.

  8. PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.

  9. Video-R1: Reinforcing Video Reasoning in MLLMs

    cs.CV 2025-03 conditional novelty 7.0

    Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.

  10. EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    EchoPrune prunes video tokens via query relevance and temporal reconstruction error to let VideoLLMs handle up to 20x more frames under fixed budget with reported gains in accuracy and speed.

  11. Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

    cs.LG 2026-05 unverdicted novelty 6.0

    A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.

  12. Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    VideoThinker improves lightweight MLLM video reasoning by creating a bias model to capture shortcuts and applying causal debiasing policy optimization to push away from them, achieving SOTA efficiency with minimal data.

  13. Video-ToC: Video Tree-of-Cue Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.

  14. MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

    cs.LG 2026-04 unverdicted novelty 6.0

    MACS improves MoE MLLM inference efficiency via entropy-weighted token loads and dynamic modality-adaptive expert capacity allocation.

  15. MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

    cs.LG 2026-04 unverdicted novelty 6.0

    MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.

  16. Watch Before You Answer: Learning from Visually Grounded Post-Training

    cs.CV 2026-04 unverdicted novelty 6.0

    Filtering post-training data to visually grounded questions improves VLM video understanding performance by up to 6.2 points using 69% of the data.

  17. Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.

  18. Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.

  19. Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.

  20. STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering

    cs.CV 2026-04 unverdicted novelty 6.0

    STRIVE stabilizes RL for video QA by creating spatiotemporal video variants and using importance-aware sampling, yielding consistent gains over baselines on six benchmarks.

  21. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...

  22. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.

  23. Kimi K2.5: Visual Agentic Intelligence

    cs.CL 2026-02 unverdicted novelty 5.0

    Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.

  24. Valley3: Scaling Omni Foundation Models for E-commerce

    cs.AI 2026-05 unverdicted novelty 4.0

    Valley3 is an omni MLLM for e-commerce that uses a four-stage pre-training pipeline plus post-training for controllable reasoning and agentic search, outperforming baselines on e-commerce benchmarks while staying comp...

  25. Seed1.5-VL Technical Report

    cs.CV 2025-05 unverdicted novelty 4.0

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 21 Pith papers · 11 internal anchors

  1. [1]

    Claude Team

    Anthropic. Claude Team. Introducing Claude 3.5 Sonnet. https://www.anthropic.com/claude/sonnet ,

  2. [2]

    A systematic classification of knowl- edge, reasoning, and context within the ARC dataset

    Michael Boratko, Harshit Padigela, Divyendra Mikkili- neni, Pritish Yuvraj, Rajarshi Das, Andrew McCallum, Maria Chang, Achille Fokoue-Nkoutche, Pavan Kapanipathi, Nicholas Mattei, Ryan Musa, Kartik Talamadupula, and Michael Witbrock. A systematic classification of knowl- edge, reasoning, and context within the ARC dataset. In Proceedings of the Workshop ...

  3. [3]

    Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models,

    Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, and Jianwei Yang. Temporalbench: Towards fine-grained temporal understanding for multimodal video models. arXiv preprint arXiv:2410.10818, 2024. 2

  4. [4]

    Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, and Christopher D. Manning. Auroracap: Efficient, perfor- mant video detailed captioning and a new benchmark. arXiv preprint arXiv:2410.03051, 2024. 2

  5. [5]

    Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answer- ing

    Xiuyuan Chen, Yuan Lin, Yuchen Zhang, and Weiran Huang. Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answer- ing. arXiv preprint arXiv:2311.14906, 2023. 5

  6. [6]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 5, 6, 2

  7. [7]

    Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing

    Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing. arXiv preprint arXiv:2406.14515, 2024. 2, 5

  8. [8]

    Bloom’s taxonomy

    Mary Forehand. Bloom’s taxonomy. Emerging perspectives on learning, teaching, and technology, 41(4):47–56, 2010. 2

  9. [9]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 2, 5

  10. [10]

    Knowit vqa: Answering knowledge-based questions about videos

    Noa Garcia, Mayu Otani, Chenhui Chu, and Yuta Nakashima. Knowit vqa: Answering knowledge-based questions about videos. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020. 2

  11. [11]

    Exploring the video-based learning research: A review of the literature

    Michail N Giannakos. Exploring the video-based learning research: A review of the literature. British Journal of Educa- tional Technology, 44(6):E191–E195, 2013. 2

  12. [12]

    Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale

    Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xi- ang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale. 2024. 5, 6, 2

  13. [13]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Man- tas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021. 2

  14. [14]

    Complex video rea- soning and robustness evaluation suite for video-lmms.arXiv preprint arXiv:2405.03690, 2024

    Muhammad Uzair khattak, Muhammad Ferjad Naeem, Jameel Hassan, Naseer Muzzamal, Federcio Tombari, Fa- had Shahbaz Khan, and Salman Khan. How good is my video lmm? complex video reasoning and robustness evaluation suite for video-lmms. arXiv:2405.03690, 2024. 2 9

  15. [15]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chun- yuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 5, 6, 2

  16. [16]

    Aria: An open multimodal native mixture-of-experts model

    Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Guoyin Wang, Bei Chen, and Junnan Li. Aria: An open multimodal native mixture-of-experts model. arXiv preprint arXiv:2410.05993, 2024. 5, 6, 2

  17. [17]

    Mvbench: A comprehensive multi- modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi- modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22195–22206, 2024. 5

  18. [18]

    Vitatecs: A diagnostic dataset for temporal concept understanding of video-language models

    Shicheng Li, Lei Li, Shuhuai Ren, Yuanxin Liu, Yi Liu, Run- dong Gao, Xu Sun, and Lu Hou. Vitatecs: A diagnostic dataset for temporal concept understanding of video-language models. arXiv preprint arXiv:2311.17404, 2024. 2

  19. [19]

    Vila: On pre-training for visual language models.arXiv preprint arXiv:2312.07533, 2023

    Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023. 5, 6, 2

  20. [20]

    Tempcompass: Do video llms really understand videos?,

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcom- pass: Do video llms really understand videos? arXiv preprint arXiv: 2403.00476, 2024. 2, 5

  21. [21]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022. 2

  22. [22]

    Egoschema: A diagnostic benchmark for very long- form video language understanding

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. In Advances in Neural Information Processing Systems, pages 46212–46244. Curran Associates, Inc., 2023. 2

  23. [23]

    Llama 3.2: Revolutionizing Edge AI and Vision with Open, Customizable Models

    Meta. Llama 3.2: Revolutionizing Edge AI and Vision with Open, Customizable Models. https://ai.meta.com/ blog/llama-3-2-connect-2024-vision-edge- mobile-devices/, 2024. 5, 6

  24. [24]

    Position: Levels of AGI for operational- izing progress on the path to AGI

    Meredith Ringel Morris, Jascha Sohl-Dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Fara- bet, and Shane Legg. Position: Levels of AGI for operational- izing progress on the path to AGI. In Proceedings of the 41st International Conference on Machine Learning, pages 36308–36321. PMLR, 2024. 2

  25. [25]

    Video-bench: A comprehen- sive benchmark and toolkit for evaluating video-based large language models

    Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehen- sive benchmark and toolkit for evaluating video-based large language models. arXiv preprint arXiv:2311.16103, 2023. 5

  26. [26]

    Introducing openai o1

    OpenAI. Introducing openai o1. https://openai.com/ o1/, 2024. 4

  27. [27]

    Hello gpt4-o

    OpenAI. Hello gpt4-o. https://openai.com/index/ hello-gpt-4o/, 2024. 2, 3, 5, 6, 1

  28. [28]

    Per- ception test: A diagnostic benchmark for multimodal video models

    Viorica P˘atr˘aucean, Lucas Smaira, Ankush Gupta, Adrià Re- casens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Jun- lin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Dame...

  29. [29]

    Robust Speech Recognition via Large-Scale Weak Supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356, 2022. 6

  30. [30]

    Video- based learning (vbl)—past, present and future: An overview of the research published from 2008 to 2019

    Marija Sabli´c, Ana Mirosavljevi´c, and Alma Škugor. Video- based learning (vbl)—past, present and future: An overview of the research published from 2008 to 2019. Technology, Knowledge and Learning, 26(4):1061–1077, 2021. 2

  31. [31]

    Moviechat: From dense to- ken to sparse memory for long video understanding.arXiv preprint arXiv:2307.16449, 2023

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, et al. Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449, 2023. 2

  32. [32]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 4, 5, 6, 1, 2

  33. [33]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191,

  34. [34]

    Lvbench: An extreme long video understanding benchmark

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Lvbench: An extreme long video understanding benchmark. arXiv preprint arXiv:2406.08035, 2024. 2

  35. [35]

    Vatex: A large-scale, high-quality multilingual dataset for video-and-language research

    Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In The IEEE International Conference on Computer Vision (ICCV),

  36. [36]

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arul- raj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding bench- mark. arXiv preprint arXiv:2406.01574, 2024. 2

  37. [37]

    Longvideobench: A benchmark for long-context inter- leaved video-language understanding

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context inter- leaved video-language understanding. arXiv preprint arXiv:2407.15754, 2024. 2

  38. [38]

    Next-qa: Next phase of question-answering to explaining tem- poral actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining tem- poral actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9777–9786, 2021. 2

  39. [39]

    Funqa: Towards surprising video comprehension

    Binzhu Xie, Sicheng Zhang, Zitang Zhou, Bo Li, Yuanhan Zhang, Jack Hessel, Jingkang Yang, and Ziwei Liu. Funqa: Towards surprising video comprehension. In European Con- 10 ference on Computer Vision, pages 39–57. Springer, 2025. 2

  40. [40]

    Video question answering via gradually refined attention over appearance and motion

    Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017. 2

  41. [41]

    Msr-vtt: A large video description dataset for bridging video and language

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2

  42. [42]

    Tenenbaum

    Kexin Yi*, Chuang Gan*, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B. Tenenbaum. Clevrer: Collision events for video representation and reasoning. In International Conference on Learning Representations, 2020. 2

  43. [43]

    The state of video-based learning: A review and future perspectives

    Ahmed Mohamed Fahmy Yousef, Mohamed Amine Chatti, and Ulrik Schroeder. The state of video-based learning: A review and future perspectives. International Journal on Advances in Life Sciences, 6(3):122–135, 2014. 2

  44. [44]

    Activitynet-qa: A dataset for understanding complex web videos via question answering

    Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, pages 9127–9134, 2019. 2

  45. [45]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

  46. [46]

    MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. Mmmu- pro: A more robust multi-discipline multimodal understand- ing benchmark. arXiv preprint arXiv:2409.02813, 2024. 2, 4

  47. [47]

    Lmms- eval: Reality check on the evaluation of large multimodal models

    Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms- eval: Reality check on the evaluation of large multimodal models. arXiv preprint arXiv:2407.12772, 2024. 5

  48. [48]

    Long Context Transfer from Language to Vision

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024. 5, 6, 2

  49. [49]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713, 2024. 5, 6, 2

  50. [50]

    Worldqa: Multimodal world knowledge in videos through long-chain reasoning

    Yuanhan Zhang, Kaichen Zhang, Bo Li, Fanyi Pu, Christo- pher Arif Setiadharma, Jingkang Yang, and Ziwei Liu. Worldqa: Multimodal world knowledge in videos through long-chain reasoning. arXiv preprint arXiv:2405.03272, 2024. 2

  51. [51]

    AGIEval: A human-centric benchmark for evalu- ating foundation models

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. AGIEval: A human-centric benchmark for evalu- ating foundation models. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 2299–2314, Mexico City, Mexico, 2024. Association for Computational Linguistics. 2

  52. [52]

    MLVU: Benchmarking Multi-task Long Video Understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264, 2024. 2

  53. [53]

    Towards au- tomatic learning of procedures from web instructional videos

    Luowei Zhou, Chenliang Xu, and Jason J Corso. Towards au- tomatic learning of procedures from web instructional videos. In AAAI Conference on Artificial Intelligence, pages 7590– 7598, 2018. 2 11 Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos Supplementary Material

  54. [54]

    Subjects categorized under six disciplines

    Subjects by Discipline Discipline Subjects Art Art History, Art Theory, Design, Music Business Accounting, Economics, Finance, Manage, Marketing Science Biology, Chemistry, Geography, Math, Physics Medicine Basic Medical Science, Clinical Medicine, Diagnostics and Laboratory Medicine, Pharmacy, Public Health Humanities History, Literature, Psychology, Soc...

  55. [55]

    Additional Knowledge Acquisition Experi- ment Results We present the results of the ∆knowledge experiment in Table

  56. [56]

    The ∆knowledge metric reveals a gap between human ex- perts and models, particularly in their ability to learn new information from videos

    This table includes a detailed breakdown of the number of questions that transitioned from Wrong-to-Right and Right- to-Wrong, along with the corresponding rates. The ∆knowledge metric reveals a gap between human ex- perts and models, particularly in their ability to learn new information from videos. This skill, which humans exhibit naturally through vid...

  57. [57]

    We introduce the prompt as shown in Fig

    Prompt for Adaptation Track In the adaptation track, we append the question’s image to the end of each video. We introduce the prompt as shown in Fig. 8

  58. [58]

    Prompt for Determining the Helpfulness of Audio For all samples in Video-MMMU, we employ Gemini 1.5 Pro [32] to analyze each video-question pair and determine if audio might be helpful to solve the question, as shown in Fig. 3c. This analysis will benefit more future Large Multi- modal Models (LMMs) with audio processing capabilities. We introduce the pro...

  59. [59]

    Annotation Pipeline We illustrate our pipeline for video collection and QA anno- tation in Fig. 10

  60. [60]

    We begin by examining errors made by Claude-3.5-Sonnet [1] in the Adaptation track

    More Error Analysis This section presents a comprehensive analysis of error cases across all three tracks. We begin by examining errors made by Claude-3.5-Sonnet [1] in the Adaptation track. Specifi- cally, Fig. 11 illustrates Method Selection Errors, while Fig. 12 demonstrates Question Misreading Errors. We also analyze error cases by GPT-4o [27] in the ...

  61. [61]

    17 and Fig

    Similarly, for the Comprehension track, we analyze two error cases shown in FigFig. 17 and Fig. 18. Each case study includes a detailed analysis of the observed errors

  62. [62]

    reason":

    Wrong-to-Right Case Analysis For the Adaptation track, we also analyze the Wrong-to- Right examples where models successfully learned from video content to correctly solve Adaptation track questions. For Claude-3.5-Sonnet [1], we present three such examples in Fig. 19, Fig. 20, and Fig. 21. Additionally, we present a Wrong-to-Right example of GPT-4o [27] ...