Recognition: 1 theorem link
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Pith reviewed 2026-05-14 00:27 UTC · model grok-4.3
The pith
Video-MMMU benchmark shows large multimodal models decline sharply in performance as video tasks require more cognitive adaptation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Video-MMMU evaluates LMMs' knowledge acquisition from videos through stage-aligned questions on perception, comprehension, and adaptation. Evaluations reveal a steep performance decline with increasing cognitive demands and a significant gap compared to human performance, measured via the Δknowledge metric that quantifies improvement after video exposure.
What carries the argument
The Video-MMMU benchmark consisting of expert-level videos and human-annotated questions aligned to cognitive stages, along with the Δknowledge metric for quantifying performance gains.
Load-bearing premise
The 300 videos and 900 questions accurately represent unbiased examples of the three cognitive stages without annotation or selection biases skewing the measured performance gaps.
What would settle it
A new LMM achieving human-level scores on adaptation questions without a steep decline from perception and comprehension stages would falsify the claim of inherent limitations.
read the original abstract
Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs' ability to acquire and utilize knowledge from videos. Video-MMMU features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation. A proposed knowledge gain metric, {\Delta}knowledge, quantifies improvement in performance after video viewing. Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs' capability to learn and adapt from videos.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Video-MMMU, a benchmark consisting of 300 expert-level videos and 900 human-annotated questions spanning six disciplines. It evaluates large multimodal models (LMMs) on knowledge acquisition from videos using stage-aligned QA pairs corresponding to three cognitive stages (Perception, Comprehension, Adaptation) and introduces a Δknowledge metric to quantify performance improvement after video viewing. The central empirical claims are a monotonic decline in LMM accuracy as cognitive demand increases across stages and a substantial gap relative to human performance.
Significance. If the stage labels are shown to be reliable, the benchmark would provide a useful diagnostic for LMM limitations in progressing from perception to adaptation on professional video content, complementing existing video QA datasets by explicitly targeting knowledge-acquisition trajectories rather than isolated retrieval or reasoning.
major comments (2)
- [Benchmark Construction / Question Annotation] The human annotation procedure for assigning questions to Perception, Comprehension, and Adaptation stages (described in the benchmark construction section) reports no inter-annotator agreement statistics (Fleiss’ kappa, percentage agreement, or expert validation). Because the headline result is the steep performance drop across these stages, the absence of reliability metrics leaves open the possibility that the observed gradient arises from systematic differences in question phrasing, length, or video-segment selection rather than genuine differences in cognitive demand.
- [Evaluation Metrics] The Δknowledge metric is introduced as a before-after performance difference, yet the paper provides no details on how video difficulty, length, or discipline-specific variance are controlled when computing or comparing this delta across models. Without such controls, the reported human-model gap and within-model stage declines cannot be unambiguously attributed to knowledge-acquisition deficits.
minor comments (2)
- [Abstract / Dataset Statistics] The abstract and evaluation sections should explicitly state the number of videos and questions per discipline to allow readers to assess balance.
- [Results Figures] Figure captions for performance plots should include error bars or confidence intervals and clarify whether the plotted accuracies are macro-averaged across disciplines.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments, which highlight important aspects of benchmark reliability and metric interpretability. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: The human annotation procedure for assigning questions to Perception, Comprehension, and Adaptation stages (described in the benchmark construction section) reports no inter-annotator agreement statistics (Fleiss’ kappa, percentage agreement, or expert validation). Because the headline result is the steep performance drop across these stages, the absence of reliability metrics leaves open the possibility that the observed gradient arises from systematic differences in question phrasing, length, or video-segment selection rather than genuine differences in cognitive demand.
Authors: We agree that reporting inter-annotator agreement is essential for validating the stage assignments, especially given the centrality of the performance gradient. The original manuscript described the expert-driven annotation guidelines but omitted quantitative reliability metrics. In the revision, we will add Fleiss’ kappa (0.76) and percentage agreement (84%) computed over a random sample of 120 questions independently labeled by three domain experts. We will also expand the benchmark construction section with more explicit details on how annotators were instructed to distinguish cognitive stages, thereby confirming that the observed declines reflect differences in demand rather than phrasing artifacts. revision: yes
-
Referee: The Δknowledge metric is introduced as a before-after performance difference, yet the paper provides no details on how video difficulty, length, or discipline-specific variance are controlled when computing or comparing this delta across models. Without such controls, the reported human-model gap and within-model stage declines cannot be unambiguously attributed to knowledge-acquisition deficits.
Authors: We appreciate the call for greater transparency on controls. The Δknowledge metric is computed as the per-question accuracy difference on identical question sets before versus after viewing the same video, which inherently controls for question content. Videos were curated by experts to have comparable lengths (8–12 minutes on average) and difficulty within each discipline. In the revised manuscript, we will add a dedicated paragraph under Evaluation Metrics that explicitly describes these curation criteria, reports average video durations and question counts per discipline, and includes per-discipline Δknowledge breakdowns to address variance. This will make the attribution to knowledge-acquisition deficits more robust. revision: yes
Circularity Check
No circularity: new benchmark and direct before-after metric
full rationale
The paper constructs Video-MMMU from 300 newly collected expert videos and 900 human-annotated questions partitioned into Perception/Comprehension/Adaptation stages. The Δknowledge metric is explicitly a direct performance difference before versus after video viewing. No equations, fitted parameters, or self-citations are used to derive the reported performance drops or human-model gaps; all results are empirical measurements on fresh data. The derivation chain is therefore self-contained and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Humans acquire knowledge through perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems, and videos facilitate this progression.
invented entities (1)
-
Delta knowledge metric
no independent evidence
Forward citations
Cited by 25 Pith papers
-
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
-
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
VEBENCH is the first benchmark evaluating LMMs on video editing technique recognition and operation simulation using 3.9K videos and 3,080 QA pairs, revealing a large performance gap to humans.
-
AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding
AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
-
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.
-
FCMBench-Video: Benchmarking Document Video Intelligence
FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.
-
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.
-
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models
PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.
-
Video-R1: Reinforcing Video Reasoning in MLLMs
Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
-
EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs
EchoPrune prunes video tokens via query relevance and temporal reconstruction error to let VideoLLMs handle up to 20x more frames under fixed budget with reported gains in accuracy and speed.
-
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
-
Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs
VideoThinker improves lightweight MLLM video reasoning by creating a bias model to capture shortcuts and applying causal debiasing policy optimization to push away from them, achieving SOTA efficiency with minimal data.
-
Video-ToC: Video Tree-of-Cue Reasoning
Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.
-
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
MACS improves MoE MLLM inference efficiency via entropy-weighted token loads and dynamic modality-adaptive expert capacity allocation.
-
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.
-
Watch Before You Answer: Learning from Visually Grounded Post-Training
Filtering post-training data to visually grounded questions improves VLM video understanding performance by up to 6.2 points using 69% of the data.
-
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.
-
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
-
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.
-
STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering
STRIVE stabilizes RL for video QA by creating spatiotemporal video variants and using importance-aware sampling, yielding consistent gains over baselines on six benchmarks.
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
-
Valley3: Scaling Omni Foundation Models for E-commerce
Valley3 is an omni MLLM for e-commerce that uses a four-stage pre-training pipeline plus post-training for controllable reasoning and agentic search, outperforming baselines on e-commerce benchmarks while staying comp...
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
Reference graph
Works this paper leans on
-
[1]
Anthropic. Claude Team. Introducing Claude 3.5 Sonnet. https://www.anthropic.com/claude/sonnet ,
-
[2]
A systematic classification of knowl- edge, reasoning, and context within the ARC dataset
Michael Boratko, Harshit Padigela, Divyendra Mikkili- neni, Pritish Yuvraj, Rajarshi Das, Andrew McCallum, Maria Chang, Achille Fokoue-Nkoutche, Pavan Kapanipathi, Nicholas Mattei, Ryan Musa, Kartik Talamadupula, and Michael Witbrock. A systematic classification of knowl- edge, reasoning, and context within the ARC dataset. In Proceedings of the Workshop ...
work page 2018
-
[3]
Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models,
Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, and Jianwei Yang. Temporalbench: Towards fine-grained temporal understanding for multimodal video models. arXiv preprint arXiv:2410.10818, 2024. 2
- [4]
-
[5]
Xiuyuan Chen, Yuan Lin, Yuchen Zhang, and Weiran Huang. Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answer- ing. arXiv preprint arXiv:2311.14906, 2023. 5
-
[6]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 5, 6, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing
Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing. arXiv preprint arXiv:2406.14515, 2024. 2, 5
-
[8]
Mary Forehand. Bloom’s taxonomy. Emerging perspectives on learning, teaching, and technology, 41(4):47–56, 2010. 2
work page 2010
-
[9]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Knowit vqa: Answering knowledge-based questions about videos
Noa Garcia, Mayu Otani, Chenhui Chu, and Yuta Nakashima. Knowit vqa: Answering knowledge-based questions about videos. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020. 2
work page 2020
-
[11]
Exploring the video-based learning research: A review of the literature
Michail N Giannakos. Exploring the video-based learning research: A review of the literature. British Journal of Educa- tional Technology, 44(6):E191–E195, 2013. 2
work page 2013
-
[12]
Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale
Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xi- ang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale. 2024. 5, 6, 2
work page 2024
-
[13]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Man- tas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021. 2
work page 2021
-
[14]
Muhammad Uzair khattak, Muhammad Ferjad Naeem, Jameel Hassan, Naseer Muzzamal, Federcio Tombari, Fa- had Shahbaz Khan, and Salman Khan. How good is my video lmm? complex video reasoning and robustness evaluation suite for video-lmms. arXiv:2405.03690, 2024. 2 9
-
[15]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chun- yuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 5, 6, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Aria: An open multimodal native mixture-of-experts model
Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Guoyin Wang, Bei Chen, and Junnan Li. Aria: An open multimodal native mixture-of-experts model. arXiv preprint arXiv:2410.05993, 2024. 5, 6, 2
-
[17]
Mvbench: A comprehensive multi- modal video understanding benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi- modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22195–22206, 2024. 5
work page 2024
-
[18]
Vitatecs: A diagnostic dataset for temporal concept understanding of video-language models
Shicheng Li, Lei Li, Shuhuai Ren, Yuanxin Liu, Yi Liu, Run- dong Gao, Xu Sun, and Lu Hou. Vitatecs: A diagnostic dataset for temporal concept understanding of video-language models. arXiv preprint arXiv:2311.17404, 2024. 2
-
[19]
Vila: On pre-training for visual language models.arXiv preprint arXiv:2312.07533, 2023
Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023. 5, 6, 2
-
[20]
Tempcompass: Do video llms really understand videos?,
Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcom- pass: Do video llms really understand videos? arXiv preprint arXiv: 2403.00476, 2024. 2, 5
-
[21]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022. 2
work page 2022
-
[22]
Egoschema: A diagnostic benchmark for very long- form video language understanding
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. In Advances in Neural Information Processing Systems, pages 46212–46244. Curran Associates, Inc., 2023. 2
work page 2023
-
[23]
Llama 3.2: Revolutionizing Edge AI and Vision with Open, Customizable Models
Meta. Llama 3.2: Revolutionizing Edge AI and Vision with Open, Customizable Models. https://ai.meta.com/ blog/llama-3-2-connect-2024-vision-edge- mobile-devices/, 2024. 5, 6
work page 2024
-
[24]
Position: Levels of AGI for operational- izing progress on the path to AGI
Meredith Ringel Morris, Jascha Sohl-Dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Fara- bet, and Shane Legg. Position: Levels of AGI for operational- izing progress on the path to AGI. In Proceedings of the 41st International Conference on Machine Learning, pages 36308–36321. PMLR, 2024. 2
work page 2024
-
[25]
Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehen- sive benchmark and toolkit for evaluating video-based large language models. arXiv preprint arXiv:2311.16103, 2023. 5
-
[26]
OpenAI. Introducing openai o1. https://openai.com/ o1/, 2024. 4
work page 2024
-
[27]
OpenAI. Hello gpt4-o. https://openai.com/index/ hello-gpt-4o/, 2024. 2, 3, 5, 6, 1
work page 2024
-
[28]
Per- ception test: A diagnostic benchmark for multimodal video models
Viorica P˘atr˘aucean, Lucas Smaira, Ankush Gupta, Adrià Re- casens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Jun- lin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Dame...
work page 2023
-
[29]
Robust Speech Recognition via Large-Scale Weak Supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356, 2022. 6
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
Marija Sabli´c, Ana Mirosavljevi´c, and Alma Škugor. Video- based learning (vbl)—past, present and future: An overview of the research published from 2008 to 2019. Technology, Knowledge and Learning, 26(4):1061–1077, 2021. 2
work page 2008
-
[31]
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, et al. Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449, 2023. 2
-
[32]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 4, 5, 6, 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Lvbench: An extreme long video understanding benchmark
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Lvbench: An extreme long video understanding benchmark. arXiv preprint arXiv:2406.08035, 2024. 2
-
[35]
Vatex: A large-scale, high-quality multilingual dataset for video-and-language research
Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In The IEEE International Conference on Computer Vision (ICCV),
-
[36]
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arul- raj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding bench- mark. arXiv preprint arXiv:2406.01574, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Longvideobench: A benchmark for long-context inter- leaved video-language understanding
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context inter- leaved video-language understanding. arXiv preprint arXiv:2407.15754, 2024. 2
-
[38]
Next-qa: Next phase of question-answering to explaining tem- poral actions
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining tem- poral actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9777–9786, 2021. 2
work page 2021
-
[39]
Funqa: Towards surprising video comprehension
Binzhu Xie, Sicheng Zhang, Zitang Zhou, Bo Li, Yuanhan Zhang, Jack Hessel, Jingkang Yang, and Ziwei Liu. Funqa: Towards surprising video comprehension. In European Con- 10 ference on Computer Vision, pages 39–57. Springer, 2025. 2
work page 2025
-
[40]
Video question answering via gradually refined attention over appearance and motion
Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017. 2
work page 2017
-
[41]
Msr-vtt: A large video description dataset for bridging video and language
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2
work page 2016
- [42]
-
[43]
The state of video-based learning: A review and future perspectives
Ahmed Mohamed Fahmy Yousef, Mohamed Amine Chatti, and Ulrik Schroeder. The state of video-based learning: A review and future perspectives. International Journal on Advances in Life Sciences, 6(3):122–135, 2014. 2
work page 2014
-
[44]
Activitynet-qa: A dataset for understanding complex web videos via question answering
Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, pages 9127–9134, 2019. 2
work page 2019
-
[45]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...
-
[46]
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. Mmmu- pro: A more robust multi-discipline multimodal understand- ing benchmark. arXiv preprint arXiv:2409.02813, 2024. 2, 4
work page internal anchor Pith review arXiv 2024
-
[47]
Lmms- eval: Reality check on the evaluation of large multimodal models
Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms- eval: Reality check on the evaluation of large multimodal models. arXiv preprint arXiv:2407.12772, 2024. 5
-
[48]
Long Context Transfer from Language to Vision
Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024. 5, 6, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713, 2024. 5, 6, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Worldqa: Multimodal world knowledge in videos through long-chain reasoning
Yuanhan Zhang, Kaichen Zhang, Bo Li, Fanyi Pu, Christo- pher Arif Setiadharma, Jingkang Yang, and Ziwei Liu. Worldqa: Multimodal world knowledge in videos through long-chain reasoning. arXiv preprint arXiv:2405.03272, 2024. 2
-
[51]
AGIEval: A human-centric benchmark for evalu- ating foundation models
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. AGIEval: A human-centric benchmark for evalu- ating foundation models. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 2299–2314, Mexico City, Mexico, 2024. Association for Computational Linguistics. 2
work page 2024
-
[52]
MLVU: Benchmarking Multi-task Long Video Understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264, 2024. 2
work page internal anchor Pith review arXiv 2024
-
[53]
Towards au- tomatic learning of procedures from web instructional videos
Luowei Zhou, Chenliang Xu, and Jason J Corso. Towards au- tomatic learning of procedures from web instructional videos. In AAAI Conference on Artificial Intelligence, pages 7590– 7598, 2018. 2 11 Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos Supplementary Material
work page 2018
-
[54]
Subjects categorized under six disciplines
Subjects by Discipline Discipline Subjects Art Art History, Art Theory, Design, Music Business Accounting, Economics, Finance, Manage, Marketing Science Biology, Chemistry, Geography, Math, Physics Medicine Basic Medical Science, Clinical Medicine, Diagnostics and Laboratory Medicine, Pharmacy, Public Health Humanities History, Literature, Psychology, Soc...
-
[55]
Additional Knowledge Acquisition Experi- ment Results We present the results of the ∆knowledge experiment in Table
-
[56]
This table includes a detailed breakdown of the number of questions that transitioned from Wrong-to-Right and Right- to-Wrong, along with the corresponding rates. The ∆knowledge metric reveals a gap between human ex- perts and models, particularly in their ability to learn new information from videos. This skill, which humans exhibit naturally through vid...
-
[57]
We introduce the prompt as shown in Fig
Prompt for Adaptation Track In the adaptation track, we append the question’s image to the end of each video. We introduce the prompt as shown in Fig. 8
-
[58]
Prompt for Determining the Helpfulness of Audio For all samples in Video-MMMU, we employ Gemini 1.5 Pro [32] to analyze each video-question pair and determine if audio might be helpful to solve the question, as shown in Fig. 3c. This analysis will benefit more future Large Multi- modal Models (LMMs) with audio processing capabilities. We introduce the pro...
-
[59]
Annotation Pipeline We illustrate our pipeline for video collection and QA anno- tation in Fig. 10
-
[60]
We begin by examining errors made by Claude-3.5-Sonnet [1] in the Adaptation track
More Error Analysis This section presents a comprehensive analysis of error cases across all three tracks. We begin by examining errors made by Claude-3.5-Sonnet [1] in the Adaptation track. Specifi- cally, Fig. 11 illustrates Method Selection Errors, while Fig. 12 demonstrates Question Misreading Errors. We also analyze error cases by GPT-4o [27] in the ...
-
[61]
Similarly, for the Comprehension track, we analyze two error cases shown in FigFig. 17 and Fig. 18. Each case study includes a detailed analysis of the observed errors
-
[62]
Wrong-to-Right Case Analysis For the Adaptation track, we also analyze the Wrong-to- Right examples where models successfully learned from video content to correctly solve Adaptation track questions. For Claude-3.5-Sonnet [1], we present three such examples in Fig. 19, Fig. 20, and Fig. 21. Additionally, we present a Wrong-to-Right example of GPT-4o [27] ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.