AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
hub Mixed citations
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Mixed citation behavior. Most common role is background (53%).
abstract
We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. The evaluation of 14 open-source LMMs as well as the proprietary GPT-4V(ision) and Gemini highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
Introduces MMBU benchmark for VLMs in biomedicine and demonstrates that established benchmarks mask perception deficiencies in evaluated models.
UltraVR is a new diagnostic benchmark for evidence-grounded VQA on ultra-resolution images, with structured chain-of-thought annotations that localize failures in grounding, perception, and inference.
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
ReaLB balances multimodal MoE inference loads by switching vision-heavy experts to lower FP4 precision per device rank, hiding the change in the dispatch phase to deliver 1.10-1.32x speedup with <1% accuracy degradation.
PlotChain benchmark reports top MLLMs reaching ~80% field-level accuracy on engineering plot reading under human-like tolerances, but with persistent failures on frequency-domain tasks like bandpass and FFT spectra.
FinCriticalED benchmark reveals that OCR and MLLM systems frequently fail to preserve critical financial facts such as numbers and monetary units even when lexical accuracy is high.
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.
V-RoAst applies zero-shot VLMs (Gemini-1.5-flash, GPT-4o-mini) to iRAP road safety attribute classification on a new ThaiRAP image dataset and compares them to CNN baselines, finding better generalization to unseen classes but weaker spatial reasoning.
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
HPP decouples perception from reasoning in long-video VLMs by having an LLM run iterative programmatic probes on hierarchically segmented video, reporting gains on LongVideoBench, EgoSchema, VideoMME, and MLVU.
QK Product Steering suppresses dominant singular modes in the per-head QK product of selected middle layers via a closed-form query-only update, yielding 4.0% average relative CHAIR_s reduction on three GQA VLMs.
Qwen-RobotWorld is a language-conditioned video world model using Double-Stream MMDiT, an 8.6M-frame embodied corpus, and progressive curriculum training that ranks first on EWMBench and DreamGen Bench.
MOSS-Video-Preview introduces a cross-attention architecture and synthesized real-time QA data to enable continuous perception, answer revision, and faster inference in video-language models compared to decoder-only designs.
AsyMoE adds hyperbolic geometry for cross-modal hierarchies and evidence-priority experts to address vision-language asymmetry in LVLMs, reporting 1.5% average gains and 25.45% fewer active parameters.
MMTB is a new benchmark with 105 multimedia terminal tasks that shows how audio and video access changes agent performance and evidence use in executable workflows.
citing papers explorer
-
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.