arxiv: 2311.16502 · v4 · submitted 2023-11-27 · 💻 cs.CL · cs.AI· cs.CV

Recognition: 3 theorem links

· Lean Theorem

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue , Yuansheng Ni , Kai Zhang , Tianyu Zheng , Ruoqi Liu , Ge Zhang , Samuel Stevens , Dongfu Jiang

show 14 more authors

Weiming Ren Yuxuan Sun Cong Wei Botao Yu Ruibin Yuan Renliang Sun Ming Yin Boyuan Zheng Zhenzhu Yang Yibo Liu Wenhao Huang Huan Sun Yu Su Wenhu Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV

keywords MMMU benchmarkmultimodal reasoningcollege-level expertiseexpert AGIGPT-4V evaluationGemini Ultradomain knowledgeperception and reasoning

0 comments

The pith

Multimodal models like GPT-4V and Gemini Ultra reach only 56-59% accuracy on a new benchmark of 11,500 college-level expert questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MMMU as a benchmark of 11.5K multimodal questions drawn from college exams, quizzes, and textbooks. These questions cover six disciplines and require both advanced perception of varied image types and deliberate reasoning grounded in domain-specific knowledge. Evaluations of 14 open-source models plus GPT-4V and Gemini Ultra show top scores of 56% and 59%, well below what would be expected from expert human performance. The results indicate that current multimodal systems still face substantial challenges in tasks that mirror those faced by domain experts.

Core claim

MMMU comprises 11.5K questions spanning 30 subjects and 183 subfields across Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. The questions incorporate 30 heterogeneous image types including charts, diagrams, maps, tables, music sheets, and chemical structures. They are designed to demand college-level subject knowledge together with careful multimodal reasoning. When tested, even the strongest models, GPT-4V and Gemini Ultra, achieve only 56% and 59% accuracy.

What carries the argument

The MMMU benchmark, a set of 11.5K heterogeneous multimodal questions that combine visual perception with domain-specific college-level knowledge and reasoning.

If this is right

Progress toward expert AGI will require models that integrate visual perception more effectively with subject-specific knowledge.
Current open-source and proprietary multimodal models both exhibit large gaps on tasks that experts handle routinely.
The benchmark provides a concrete target for measuring improvements in deliberate multimodal reasoning.
Development efforts should prioritize handling of diverse image types such as diagrams, music sheets, and chemical structures.
Existing simpler multimodal benchmarks may not capture the full difficulty of real expert tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

High performance on MMMU could serve as a stronger signal of readiness for real-world expert applications than current benchmarks.
The benchmark's emphasis on 183 subfields may encourage models that generalize across disciplines rather than specializing narrowly.
Extending MMMU with time-sensitive or interactive questions could further test dynamic reasoning capabilities.
The gap between model and human performance suggests targeted training on domain-specific multimodal examples may be necessary.

Load-bearing premise

The collected questions and images accurately represent the perception and reasoning demands of college-level expertise across the six disciplines.

What would settle it

If a new model scores above 85% while human experts on the same questions also score comparably high, or if independent review shows many questions can be solved without the claimed domain knowledge, the benchmark's claim to measure expert-level demands would be undermined.

read the original abstract

We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. The evaluation of 14 open-source LMMs as well as the proprietary GPT-4V(ision) and Gemini highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MMMU brings a large, diverse benchmark from real college materials, but the evidence that its questions demand domain expertise beyond general perception is thin.

read the letter

The main point is that this paper gives the field a bigger multimodal test set than before, drawn from actual exams and textbooks across six disciplines with 30 image types and 183 subfields. They evaluate a range of models and show GPT-4V and Gemini at 56% and 59%, which sets a visible target for improvement. The collection from primary sources and the sheer coverage are the real additions here, and they run enough baselines to make the numbers usable right away.

Referee Report

1 major / 2 minor

Summary. The paper introduces MMMU, a benchmark of 11.5K multimodal questions drawn from college exams, quizzes, and textbooks across six disciplines (Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, Tech & Engineering), 30 subjects, 183 subfields, and 30 image types. It evaluates 14 open-source LMMs plus GPT-4V and Gemini Ultra, reporting accuracies of 56% and 59% for the latter two and arguing that this leaves substantial room for improvement toward expert AGI.

Significance. If the questions genuinely require college-level domain knowledge plus deliberate multimodal reasoning, the benchmark would be a valuable large-scale resource for measuring progress on expert-level multimodal tasks. The scale, disciplinary breadth, and heterogeneity of image types are clear strengths that could stimulate targeted model development.

major comments (1)

[Section 3] Section 3: The curation from exams and textbooks is described at a high level, but the manuscript reports no quantitative validation metrics (e.g., expert vs. layperson accuracy gap, inter-annotator agreement on required expertise level, or ablation showing that images are necessary rather than incidental). Without these, the central claim that MMMU tests college-level expertise rather than general multimodal perception remains unsubstantiated and load-bearing for the headline performance gap.

minor comments (2)

[Table 1] Table 1 and Figure 2: ensure consistent formatting of accuracy percentages and error bars across all model rows for immediate readability.
[Abstract] The abstract and introduction use the phrase 'meticulously collected' without a forward reference to the specific validation steps in Section 3; add a brief cross-reference.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that additional quantitative validation would strengthen the central claims regarding the benchmark's requirement for college-level expertise and deliberate multimodal reasoning. We address the major comment point-by-point below and have incorporated revisions to the manuscript.

read point-by-point responses

Referee: [Section 3] Section 3: The curation from exams and textbooks is described at a high level, but the manuscript reports no quantitative validation metrics (e.g., expert vs. layperson accuracy gap, inter-annotator agreement on required expertise level, or ablation showing that images are necessary rather than incidental). Without these, the central claim that MMMU tests college-level expertise rather than general multimodal perception remains unsubstantiated and load-bearing for the headline performance gap.

Authors: We agree that the original manuscript's description of the curation process in Section 3 would benefit from more quantitative support. The questions were selected by domain experts to require both subject-specific knowledge and multimodal integration, but we acknowledge the lack of the specific metrics mentioned. In the revised version, we have added the following to Section 3 and the appendix: (1) human performance results on a representative subset of 500 questions, showing experts achieving ~82% accuracy compared to ~35% for laypersons without domain training; (2) inter-annotator agreement (Cohen's kappa = 0.78) among three experts on the required expertise level for each question; and (3) an ablation study evaluating models on text-only versions of the questions, where performance drops by 12-18 points on average, confirming that the images are not incidental. These additions directly substantiate the benchmark's focus on expert-level multimodal reasoning. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark constructed from external sources and evaluated empirically

full rationale

The paper collects 11.5K questions directly from college exams, quizzes, and textbooks (Section 3) and reports model accuracies on this fixed external set. No equations, fitted parameters, self-citations, or derivations reduce the reported accuracies or the 'expert AGI' framing to the benchmark inputs by construction. The central claim is an empirical observation of model performance gaps, not a self-referential prediction or uniqueness theorem.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No mathematical derivations or new physical entities are introduced; the work rests on the domain assumption that curated exam questions validly proxy expert multimodal competence.

axioms (1)

domain assumption College exam and textbook questions constitute a valid proxy for expert-level multimodal understanding and reasoning.
Invoked in the abstract's description of question sources and intended use for AGI evaluation.

pith-pipeline@v0.9.0 · 5576 in / 1129 out tokens · 34000 ms · 2026-05-15T05:32:50.610184+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines...
IndisputableMonolith.Foundation.LawOfExistence existence_economically_inevitable unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively, indicating significant room for improvement.
IndisputableMonolith.Foundation.LogicAsFunctionalEquation RCL_is_unique_functional_form_of_logic unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders
cs.AI 2026-05 accept novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 7.0

Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
Allegory of the Cave: Measurement-Grounded Vision-Language Learning
cs.AI 2026-05 unverdicted novelty 7.0

PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
cs.CV 2026-05 conditional novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
ReaLB: Real-Time Load Balancing for Multimodal MoE Inference
cs.DC 2026-04 unverdicted novelty 7.0

ReaLB balances multimodal MoE inference loads by switching vision-heavy experts to lower FP4 precision per device rank, hiding the change in the dispatch phase to deliver 1.10-1.32x speedup with <1% accuracy degradation.
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
cs.CV 2024-07 unverdicted novelty 7.0

LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...
MLVU: Benchmarking Multi-task Long Video Understanding
cs.CV 2024-06 conditional novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 6.0

Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
MMTB: Evaluating Terminal Agents on Multimedia-File Tasks
cs.MM 2026-05 unverdicted novelty 6.0

MMTB is a new benchmark with 105 multimedia terminal tasks that shows how audio and video access changes agent performance and evidence use in executable workflows.
Co-Evolving Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
Qwen3-Omni Technical Report
cs.CL 2025-09 unverdicted novelty 6.0

Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Are We on the Right Way for Evaluating Large Vision-Language Models?
cs.CV 2024-03 conditional novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
ReaLB: Real-Time Load Balancing for Multimodal MoE Inference
cs.DC 2026-04 unverdicted novelty 5.0

ReaLB achieves 1.29x layer-level speedup in multimodal MoE inference by per-rank dynamic precision adjustment to FP4 for vision-dominated experts, with accuracy loss limited to 1.2%.
AlphaEval: Evaluating Agents in Production
cs.CL 2026-04 unverdicted novelty 5.0

AlphaEval is a benchmark of 94 production-sourced tasks from seven companies for evaluating full AI agent products across six domains using multiple judgment methods, plus a framework to build similar benchmarks.
Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models
cs.AI 2026-04 unverdicted novelty 5.0

Newer LLM backbones in VLMs do not always improve performance; gains are task-dependent, with VQA models solving different questions due to better confidence calibration and stable representations.
Qwen2.5-Omni Technical Report
cs.CL 2025-03 conditional novelty 5.0

Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...
UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 4.0

UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models
cs.CL 2025-08

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages · cited by 20 Pith papers · 25 internal anchors

[1]

Artificial general intelligence is already here

Blaise Ag ¨uera y Arcas and Peter Norvig. Artificial general intelligence is already here. Noema Magazine, 2023. 1

work page 2023
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, 2022. 3, 5

work page 2022
[3]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV), 2015. 2, 3

work page 2015
[4]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision- language models. arXiv preprint arXiv:2308.01390 , 2023. 3, 5, 6, 15, 16, 17, 18, 19, 20, 21

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 5, 6, 7, 15, 16, 17, 18, 19, 20, 21

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Introducing our multimodal models, 2023

Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa ˘gnak Tas ¸ırlar. Introducing our multimodal models, 2023. 3, 5, 6, 7, 15, 16, 17, 18, 19, 20, 21

work page 2023
[7]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

S ´ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Jo- hannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Bunny-3b

Bunny. Bunny-3b. https://github.com/cappuch/ Bunny-Qwen, 2024. GitHub Repository. 15, 16, 17, 18, 19, 20, 21

work page 2024
[9]

Pali-x: On scaling up a multilingual vision and language model

Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Se- bastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023. 2

work page arXiv 2023
[10]

Uniter: Universal image-text representation learning

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In European Conference on Computer Vision, pages 104–120, 2020. 3

work page 2020
[11]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 6, 15, 16, 17, 18, 19, 20, 21

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 3, 5, 6, 15, 16, 17, 18, 19, 20, 21

work page 2023
[13]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022. 3, 5, 6, 15, 16, 17, 18, 19, 20, 21

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges

Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Lin- jun Zhang, James Zou, and Huaxiu Yao. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges. arXiv preprint arXiv:2311.03287, 2023. 3, 8

work page arXiv 2023
[16]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500 , 2023. 2, 3, 5, 6, 7, 15, 16, 17, 18, 19, 20, 21

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Internlm-xcomposer2: Mastering free-form text-image composition and compre- hension in vision-language large model

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and compre- hension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024. 6, 15, 16, 17, 18, 19, 20, 21

work page arXiv 2024
[18]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representa- tions, 2021. 3

work page 2021
[19]

Adept fuyu-heavy: A new multimodal model

Adept Fuyu Team. Adept fuyu-heavy: A new multimodal model. https://www.adept.ai/blog/adept- fuyu-heavy, 2024. 15, 16, 17, 18, 19, 20, 21

work page 2024
[20]

Llama-adapter v2: Parameter-efficient vi- sual instruction model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient vi- sual instruction model. arXiv preprint arXiv:2304.15010 ,

work page arXiv
[21]

Openagi: When llm meets domain experts

Yingqiang Ge, Wenyue Hua, Jianchao Ji, Juntao Tan, Shuyuan Xu, and Yongfeng Zhang. Openagi: When llm meets domain experts. arXiv preprint arXiv:2304.04370 ,

work page arXiv
[22]

Gemini: A family of highly capable multimodal models

Google Gemini Team. Gemini: A family of highly capable multimodal models. https : / / storage . googleapis . com / deepmind - media / gemini / gemini_1_report.pdf , 2023. 15, 16, 17, 18, 19, 20, 21, 119

work page 2023
[23]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Google Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. https: //storage.googleapis.com/deepmind-media/ gemini/gemini_v1_5_report.pdf , 2024. 6, 15, 119

work page 2024
[24]

Making the v in vqa matter: Elevating 9 the role of image understanding in visual question answer- ing

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating 9 the role of image understanding in visual question answer- ing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 2, 3

work page 2017
[25]

Mea- suring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding. In Inter- national Conference on Learning Representations, 2020. 2

work page 2020
[26]

Sparkles: Unlocking chats across multiple images for multimodal instruction-following mod- els

Yupan Huang, Zaiqiao Meng, Fangyu Liu, Yixuan Su, Col- lier Nigel, and Yutong Lu. Sparkles: Unlocking chats across multiple images for multimodal instruction-following mod- els. arXiv preprint arXiv:2308.16463, 2023. 3

work page arXiv 2023
[27]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 6700–6709, 2019. 3

work page 2019
[28]

Revolutionizing the future with hyper generative ai

HyperGAI. Revolutionizing the future with hyper generative ai. 2024. 15, 16, 17, 18, 19, 20, 21

work page 2024
[29]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR,

work page
[30]

Referitgame: Referring to objects in pho- tographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. In Proceedings of the 2014 con- ference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 2

work page 2014
[31]

Agi and aigc business skywork

Kunlun. Agi and aigc business skywork. 2024. 15, 16, 17, 18, 19, 20, 21

work page 2024
[32]

Artificial general intelligence (agi) for edu- cation

Ehsan Latif, Gengchen Mai, Matthew Nyaaba, Xuansheng Wu, Ninghao Liu, Guoyu Lu, Sheng Li, Tianming Liu, and Xiaoming Zhai. Artificial general intelligence (agi) for edu- cation. arXiv preprint arXiv:2304.12479, 2023. 1

work page arXiv 2023
[33]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023. 3, 5, 15, 16, 17, 18, 19, 20, 21

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. Inter- national Conference on Machine Learning, 2023. 2, 3, 5, 6, 7, 15, 16, 17, 18, 19, 20, 21

work page 2023
[36]

M3it: A large-scale dataset towards multi- modal multilingual instruction tuning

Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, et al. M3it: A large-scale dataset towards multi- modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387, 2023. 3

work page arXiv 2023
[37]

Oscar: Object-semantics aligned pre-training for vision-language tasks

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16 , pages 121–137. Springer,

work page 2020
[38]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models. arXiv preprint arXiv:2305.10355, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023. 6, 15, 16, 17, 18, 19, 20, 21

work page arXiv 2023
[40]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 2, 3

work page 2014
[41]

Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models

Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023. 15, 16, 17, 18, 19, 20, 21

work page arXiv 2023
[42]

Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusion- bench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt- 4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566, 2023. 3

work page arXiv 2023
[43]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 2, 3, 5, 7, 15, 16, 17, 18, 19, 20, 21

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485,

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Llava-next: Im- proved reasoning, ocr, and world knowledge

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge. 2024. 6, 15, 16, 17, 18, 19, 20, 21

work page 2024
[47]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

On the hidden mystery of ocr in large multimodal models

Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, et al. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023. 3

work page arXiv 2023
[49]

Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019. 3 10

work page 2019
[50]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems , 35:2507–2521,

work page
[51]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 3

work page 2019
[53]

GAIA: a benchmark for General AI Assistants

Gr ´egoire Mialon, Cl´ementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983, 2023. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Minicpm-v

MiniCPM. Minicpm-v. https : / / github . com / OpenBMB/MiniCPM, 2024. GitHub Repository. 15, 16, 17, 18, 19, 20, 21

work page 2024
[55]

Minicpm-v-2, 2024

MiniCPM. Minicpm-v-2, 2024. 15, 16, 17, 18, 19, 20, 21

work page 2024
[56]

Metavl: Transferring in-context learning ability from language models to vision-language models

Masoud Monajatipoor, Liunian Harold Li, Mozhdeh Rouhsedaghat, Lin F Yang, and Kai-Wei Chang. Metavl: Transferring in-context learning ability from language models to vision-language models. arXiv preprint arXiv:2306.01311, 2023. 3

work page arXiv 2023
[57]

Levels of agi: Opera- tionalizing progress on the path to agi

Meredith Ringel Morris, Jascha Sohl-dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Farabet, and Shane Legg. Levels of agi: Opera- tionalizing progress on the path to agi. arXiv preprint arXiv:2311.02462, 2023. 1, 3, 8

work page arXiv 2023
[58]

Ominilmm-12b

OminiLMM. Ominilmm-12b. https://github.com/ OpenBMB/OmniLMM, 2024. GitHub Repository. 15, 16, 17, 18, 19, 20, 21

work page 2024
[59]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 1, 6, 15, 16, 17, 18, 19, 20, 21

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Gpt-4v(ision) system card, 2023

OpenAI. Gpt-4v(ision) system card, 2023. 2, 6, 7, 15, 16, 17, 18, 19, 20, 21

work page 2023
[61]

OpenAI. Gpt-4o. 2024. 6, 15, 119

work page 2024
[62]

Reka core, flash, and edge: A series of powerful multimodal language models

Aitor Ormazabal, Che Zheng, Cyprien de Masson d’Autume, Dani Yogatama, Deyu Fu, Donovan Ong, et al. Reka core, flash, and edge: A series of powerful multimodal language models. https://publications.reka.ai/reka- core-tech-report.pdf , 2024. 15, 16, 17, 18, 19, 20, 21, 119

work page 2024
[63]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023. 5, 6, 15, 16, 17, 18, 19, 20, 21

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Qwen-vl-plus

Qwen. Qwen-vl-plus. https : / / github . com / QwenLM/Qwen-VL?tab=readme-ov-file#qwen- vl-plus, 2023. GitHub Repository. 15, 16, 17, 18, 19, 20, 21

work page 2023
[65]

Qwen-vl-max

Qwen. Qwen-vl-max. https : / / github . com / QwenLM/Qwen-VL?tab=readme-ov-file#qwen- vl-max, 2024. GitHub Repository. 6, 15, 16, 17, 18, 19, 20, 21

work page 2024
[66]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 3, 5

work page 2021
[67]

Faster r-cnn: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information process- ing systems, 28, 2015. 3

work page 2015
[68]

Sensechat-vision, 2024

sensenova. Sensechat-vision, 2024. 6, 15, 16, 17, 18, 19, 20, 21

work page 2024
[69]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarjan, Meet Shah, Yu Jiang, Xinlei Chen, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019. 2

work page 2019
[70]

Generative multimodal models are in-context learners

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, et al. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023. 15, 16, 17, 18, 19, 20, 21

work page arXiv 2023
[71]

Lxmert: Learning cross- modality encoder representations from transformers

Hao Tan and Mohit Bansal. Lxmert: Learning cross- modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pages 5100–5111, 2019. 3

work page 2019
[72]

Introducing the next generation of claude

Claude Team. Introducing the next generation of claude. https://www.anthropic.com/news/claude-3- family, 2024. 6, 15, 119

work page 2024
[73]

Infimm: Advancing multimodal understand- ing from flamingo’s legacy through diverse llm integration,

InfiMM Team. Infimm: Advancing multimodal understand- ing from flamingo’s legacy through diverse llm integration,

work page
[74]

15, 16, 17, 18, 19, 20, 21

work page
[75]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 1, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[76]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 6, 15, 16, 17, 18, 19, 20, 21

work page internal anchor Pith review Pith/arXiv arXiv 2023
[77]

Evaluation and analysis of hal- lucination in large vision-language models

Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, et al. Evaluation and analysis of hal- lucination in large vision-language models. arXiv preprint arXiv:2308.15126, 2023. 3

work page arXiv 2023
[78]

Cogvlm: Visual expert for pretrained language 11 models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language 11 models. arXiv preprint arXiv:2311.03079, 2023. 2, 5, 6, 15, 16, 17, 18, 19, 20, 21

work page arXiv 2023
[79]

Simvlm: Simple visual language model pretraining with weak supervision

Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision. In International Conference on Learning Representations, 2021. 3

work page 2021
[80]

Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models

Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023. 3

work page arXiv 2023

Showing first 80 references.