pith. sign in

hub Mixed citations

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Mixed citation behavior. Most common role is background (53%).

60 Pith papers citing it
Background 53% of classified citations
abstract

We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. The evaluation of 14 open-source LMMs as well as the proprietary GPT-4V(ision) and Gemini highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.

hub tools

citation-role summary

background 9 dataset 5 baseline 1

citation-polarity summary

clear filters

representative citing papers

ReaLB: Real-Time Load Balancing for Multimodal MoE Inference

cs.DC · 2026-04-21 · unverdicted · novelty 7.0

ReaLB balances multimodal MoE inference loads by switching vision-heavy experts to lower FP4 precision per device rank, hiding the change in the dispatch phase to deliver 1.10-1.32x speedup with <1% accuracy degradation.

MLVU: Benchmarking Multi-task Long Video Understanding

cs.CV · 2024-06-06 · conditional · novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

citing papers explorer

Showing 1 of 1 citing paper after filters.