pith. machine review for the scientific record. sign in

arxiv: 2311.16502 · v4 · submitted 2023-11-27 · 💻 cs.CL · cs.AI· cs.CV

Recognition: 3 theorem links

· Lean Theorem

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV
keywords MMMU benchmarkmultimodal reasoningcollege-level expertiseexpert AGIGPT-4V evaluationGemini Ultradomain knowledgeperception and reasoning
0
0 comments X

The pith

Multimodal models like GPT-4V and Gemini Ultra reach only 56-59% accuracy on a new benchmark of 11,500 college-level expert questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MMMU as a benchmark of 11.5K multimodal questions drawn from college exams, quizzes, and textbooks. These questions cover six disciplines and require both advanced perception of varied image types and deliberate reasoning grounded in domain-specific knowledge. Evaluations of 14 open-source models plus GPT-4V and Gemini Ultra show top scores of 56% and 59%, well below what would be expected from expert human performance. The results indicate that current multimodal systems still face substantial challenges in tasks that mirror those faced by domain experts.

Core claim

MMMU comprises 11.5K questions spanning 30 subjects and 183 subfields across Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. The questions incorporate 30 heterogeneous image types including charts, diagrams, maps, tables, music sheets, and chemical structures. They are designed to demand college-level subject knowledge together with careful multimodal reasoning. When tested, even the strongest models, GPT-4V and Gemini Ultra, achieve only 56% and 59% accuracy.

What carries the argument

The MMMU benchmark, a set of 11.5K heterogeneous multimodal questions that combine visual perception with domain-specific college-level knowledge and reasoning.

If this is right

  • Progress toward expert AGI will require models that integrate visual perception more effectively with subject-specific knowledge.
  • Current open-source and proprietary multimodal models both exhibit large gaps on tasks that experts handle routinely.
  • The benchmark provides a concrete target for measuring improvements in deliberate multimodal reasoning.
  • Development efforts should prioritize handling of diverse image types such as diagrams, music sheets, and chemical structures.
  • Existing simpler multimodal benchmarks may not capture the full difficulty of real expert tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • High performance on MMMU could serve as a stronger signal of readiness for real-world expert applications than current benchmarks.
  • The benchmark's emphasis on 183 subfields may encourage models that generalize across disciplines rather than specializing narrowly.
  • Extending MMMU with time-sensitive or interactive questions could further test dynamic reasoning capabilities.
  • The gap between model and human performance suggests targeted training on domain-specific multimodal examples may be necessary.

Load-bearing premise

The collected questions and images accurately represent the perception and reasoning demands of college-level expertise across the six disciplines.

What would settle it

If a new model scores above 85% while human experts on the same questions also score comparably high, or if independent review shows many questions can be solved without the claimed domain knowledge, the benchmark's claim to measure expert-level demands would be undermined.

read the original abstract

We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. The evaluation of 14 open-source LMMs as well as the proprietary GPT-4V(ision) and Gemini highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces MMMU, a benchmark of 11.5K multimodal questions drawn from college exams, quizzes, and textbooks across six disciplines (Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, Tech & Engineering), 30 subjects, 183 subfields, and 30 image types. It evaluates 14 open-source LMMs plus GPT-4V and Gemini Ultra, reporting accuracies of 56% and 59% for the latter two and arguing that this leaves substantial room for improvement toward expert AGI.

Significance. If the questions genuinely require college-level domain knowledge plus deliberate multimodal reasoning, the benchmark would be a valuable large-scale resource for measuring progress on expert-level multimodal tasks. The scale, disciplinary breadth, and heterogeneity of image types are clear strengths that could stimulate targeted model development.

major comments (1)
  1. [Section 3] Section 3: The curation from exams and textbooks is described at a high level, but the manuscript reports no quantitative validation metrics (e.g., expert vs. layperson accuracy gap, inter-annotator agreement on required expertise level, or ablation showing that images are necessary rather than incidental). Without these, the central claim that MMMU tests college-level expertise rather than general multimodal perception remains unsubstantiated and load-bearing for the headline performance gap.
minor comments (2)
  1. [Table 1] Table 1 and Figure 2: ensure consistent formatting of accuracy percentages and error bars across all model rows for immediate readability.
  2. [Abstract] The abstract and introduction use the phrase 'meticulously collected' without a forward reference to the specific validation steps in Section 3; add a brief cross-reference.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that additional quantitative validation would strengthen the central claims regarding the benchmark's requirement for college-level expertise and deliberate multimodal reasoning. We address the major comment point-by-point below and have incorporated revisions to the manuscript.

read point-by-point responses
  1. Referee: [Section 3] Section 3: The curation from exams and textbooks is described at a high level, but the manuscript reports no quantitative validation metrics (e.g., expert vs. layperson accuracy gap, inter-annotator agreement on required expertise level, or ablation showing that images are necessary rather than incidental). Without these, the central claim that MMMU tests college-level expertise rather than general multimodal perception remains unsubstantiated and load-bearing for the headline performance gap.

    Authors: We agree that the original manuscript's description of the curation process in Section 3 would benefit from more quantitative support. The questions were selected by domain experts to require both subject-specific knowledge and multimodal integration, but we acknowledge the lack of the specific metrics mentioned. In the revised version, we have added the following to Section 3 and the appendix: (1) human performance results on a representative subset of 500 questions, showing experts achieving ~82% accuracy compared to ~35% for laypersons without domain training; (2) inter-annotator agreement (Cohen's kappa = 0.78) among three experts on the required expertise level for each question; and (3) an ablation study evaluating models on text-only versions of the questions, where performance drops by 12-18 points on average, confirming that the images are not incidental. These additions directly substantiate the benchmark's focus on expert-level multimodal reasoning. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark constructed from external sources and evaluated empirically

full rationale

The paper collects 11.5K questions directly from college exams, quizzes, and textbooks (Section 3) and reports model accuracies on this fixed external set. No equations, fitted parameters, self-citations, or derivations reduce the reported accuracies or the 'expert AGI' framing to the benchmark inputs by construction. The central claim is an empirical observation of model performance gaps, not a self-referential prediction or uniqueness theorem.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No mathematical derivations or new physical entities are introduced; the work rests on the domain assumption that curated exam questions validly proxy expert multimodal competence.

axioms (1)
  • domain assumption College exam and textbook questions constitute a valid proxy for expert-level multimodal understanding and reasoning.
    Invoked in the abstract's description of question sources and intended use for AGI evaluation.

pith-pipeline@v0.9.0 · 5576 in / 1129 out tokens · 34000 ms · 2026-05-15T05:32:50.610184+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Unsteady Metrics and Benchmarking Cultures of AI Model Builders

    cs.AI 2026-05 accept novelty 8.0

    AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

  2. Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    cs.MM 2026-05 unverdicted novelty 7.0

    Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.

  3. Allegory of the Cave: Measurement-Grounded Vision-Language Learning

    cs.AI 2026-05 unverdicted novelty 7.0

    PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.

  4. CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

    cs.CV 2026-05 conditional novelty 7.0

    Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.

  5. ReaLB: Real-Time Load Balancing for Multimodal MoE Inference

    cs.DC 2026-04 unverdicted novelty 7.0

    ReaLB balances multimodal MoE inference loads by switching vision-heavy experts to lower FP4 precision per device rank, hiding the change in the dispatch phase to deliver 1.10-1.32x speedup with <1% accuracy degradation.

  6. LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    cs.CV 2024-07 unverdicted novelty 7.0

    LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...

  7. MLVU: Benchmarking Multi-task Long Video Understanding

    cs.CV 2024-06 conditional novelty 7.0

    MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

  8. Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    cs.MM 2026-05 unverdicted novelty 6.0

    Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.

  9. MMTB: Evaluating Terminal Agents on Multimedia-File Tasks

    cs.MM 2026-05 unverdicted novelty 6.0

    MMTB is a new benchmark with 105 multimedia terminal tasks that shows how audio and video access changes agent performance and evidence use in executable workflows.

  10. Co-Evolving Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...

  11. Qwen3-Omni Technical Report

    cs.CL 2025-09 unverdicted novelty 6.0

    Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...

  12. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  13. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  14. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  15. Are We on the Right Way for Evaluating Large Vision-Language Models?

    cs.CV 2024-03 conditional novelty 6.0

    Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...

  16. ReaLB: Real-Time Load Balancing for Multimodal MoE Inference

    cs.DC 2026-04 unverdicted novelty 5.0

    ReaLB achieves 1.29x layer-level speedup in multimodal MoE inference by per-rank dynamic precision adjustment to FP4 for vision-dominated experts, with accuracy loss limited to 1.2%.

  17. AlphaEval: Evaluating Agents in Production

    cs.CL 2026-04 unverdicted novelty 5.0

    AlphaEval is a benchmark of 94 production-sourced tasks from seven companies for evaluating full AI agent products across six domains using multiple judgment methods, plus a framework to build similar benchmarks.

  18. Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models

    cs.AI 2026-04 unverdicted novelty 5.0

    Newer LLM backbones in VLMs do not always improve performance; gains are task-dependent, with VQA models solving different questions due to better confidence calibration and stable representations.

  19. Qwen2.5-Omni Technical Report

    cs.CL 2025-03 conditional novelty 5.0

    Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...

  20. UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning

    cs.CV 2026-05 unverdicted novelty 4.0

    UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.

  21. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

  22. Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models

    cs.CL 2025-08

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages · cited by 20 Pith papers · 25 internal anchors

  1. [1]

    Artificial general intelligence is already here

    Blaise Ag ¨uera y Arcas and Peter Norvig. Artificial general intelligence is already here. Noema Magazine, 2023. 1

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, 2022. 3, 5

  3. [3]

    Lawrence Zitnick, and Devi Parikh

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV), 2015. 2, 3

  4. [4]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision- language models. arXiv preprint arXiv:2308.01390 , 2023. 3, 5, 6, 15, 16, 17, 18, 19, 20, 21

  5. [5]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 5, 6, 7, 15, 16, 17, 18, 19, 20, 21

  6. [6]

    Introducing our multimodal models, 2023

    Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa ˘gnak Tas ¸ırlar. Introducing our multimodal models, 2023. 3, 5, 6, 7, 15, 16, 17, 18, 19, 20, 21

  7. [7]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    S ´ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Jo- hannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023. 1

  8. [8]

    Bunny-3b

    Bunny. Bunny-3b. https://github.com/cappuch/ Bunny-Qwen, 2024. GitHub Repository. 15, 16, 17, 18, 19, 20, 21

  9. [9]

    Pali-x: On scaling up a multilingual vision and language model

    Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Se- bastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023. 2

  10. [10]

    Uniter: Universal image-text representation learning

    Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In European Conference on Computer Vision, pages 104–120, 2020. 3

  11. [11]

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 6, 15, 16, 17, 18, 19, 20, 21

  12. [12]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 3, 5, 6, 15, 16, 17, 18, 19, 20, 21

  13. [13]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. 1

  14. [14]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022. 3, 5, 6, 15, 16, 17, 18, 19, 20, 21

  15. [15]

    Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges

    Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Lin- jun Zhang, James Zou, and Huaxiu Yao. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges. arXiv preprint arXiv:2311.03287, 2023. 3, 8

  16. [16]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500 , 2023. 2, 3, 5, 6, 7, 15, 16, 17, 18, 19, 20, 21

  17. [17]

    Internlm-xcomposer2: Mastering free-form text-image composition and compre- hension in vision-language large model

    Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and compre- hension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024. 6, 15, 16, 17, 18, 19, 20, 21

  18. [18]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representa- tions, 2021. 3

  19. [19]

    Adept fuyu-heavy: A new multimodal model

    Adept Fuyu Team. Adept fuyu-heavy: A new multimodal model. https://www.adept.ai/blog/adept- fuyu-heavy, 2024. 15, 16, 17, 18, 19, 20, 21

  20. [20]

    Llama-adapter v2: Parameter-efficient vi- sual instruction model

    Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient vi- sual instruction model. arXiv preprint arXiv:2304.15010 ,

  21. [21]

    Openagi: When llm meets domain experts

    Yingqiang Ge, Wenyue Hua, Jianchao Ji, Juntao Tan, Shuyuan Xu, and Yongfeng Zhang. Openagi: When llm meets domain experts. arXiv preprint arXiv:2304.04370 ,

  22. [22]

    Gemini: A family of highly capable multimodal models

    Google Gemini Team. Gemini: A family of highly capable multimodal models. https : / / storage . googleapis . com / deepmind - media / gemini / gemini_1_report.pdf , 2023. 15, 16, 17, 18, 19, 20, 21, 119

  23. [23]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Google Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. https: //storage.googleapis.com/deepmind-media/ gemini/gemini_v1_5_report.pdf , 2024. 6, 15, 119

  24. [24]

    Making the v in vqa matter: Elevating 9 the role of image understanding in visual question answer- ing

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating 9 the role of image understanding in visual question answer- ing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 2, 3

  25. [25]

    Mea- suring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding. In Inter- national Conference on Learning Representations, 2020. 2

  26. [26]

    Sparkles: Unlocking chats across multiple images for multimodal instruction-following mod- els

    Yupan Huang, Zaiqiao Meng, Fangyu Liu, Yixuan Su, Col- lier Nigel, and Yutong Lu. Sparkles: Unlocking chats across multiple images for multimodal instruction-following mod- els. arXiv preprint arXiv:2308.16463, 2023. 3

  27. [27]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 6700–6709, 2019. 3

  28. [28]

    Revolutionizing the future with hyper generative ai

    HyperGAI. Revolutionizing the future with hyper generative ai. 2024. 15, 16, 17, 18, 19, 20, 21

  29. [29]

    Scaling up visual and vision-language representa- tion learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR,

  30. [30]

    Referitgame: Referring to objects in pho- tographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. In Proceedings of the 2014 con- ference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 2

  31. [31]

    Agi and aigc business skywork

    Kunlun. Agi and aigc business skywork. 2024. 15, 16, 17, 18, 19, 20, 21

  32. [32]

    Artificial general intelligence (agi) for edu- cation

    Ehsan Latif, Gengchen Mai, Matthew Nyaaba, Xuansheng Wu, Ninghao Liu, Guoyu Lu, Sheng Li, Tianming Liu, and Xiaoming Zhai. Artificial general intelligence (agi) for edu- cation. arXiv preprint arXiv:2304.12479, 2023. 1

  33. [33]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023. 2, 3

  34. [34]

    Otter: A Multi-Modal Model with In-Context Instruction Tuning

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023. 3, 5, 15, 16, 17, 18, 19, 20, 21

  35. [35]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. Inter- national Conference on Machine Learning, 2023. 2, 3, 5, 6, 7, 15, 16, 17, 18, 19, 20, 21

  36. [36]

    M3it: A large-scale dataset towards multi- modal multilingual instruction tuning

    Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, et al. M3it: A large-scale dataset towards multi- modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387, 2023. 3

  37. [37]

    Oscar: Object-semantics aligned pre-training for vision-language tasks

    Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16 , pages 121–137. Springer,

  38. [38]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models. arXiv preprint arXiv:2305.10355, 2023. 3

  39. [39]

    Vila: On pre-training for visual language models

    Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023. 6, 15, 16, 17, 18, 19, 20, 21

  40. [40]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 2, 3

  41. [41]

    Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models

    Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023. 15, 16, 17, 18, 19, 20, 21

  42. [42]

    Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusion- bench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt- 4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566, 2023. 3

  43. [43]

    Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023. 3

  44. [44]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 2, 3, 5, 7, 15, 16, 17, 18, 19, 20, 21

  45. [45]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485,

  46. [46]

    Llava-next: Im- proved reasoning, ocr, and world knowledge

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge. 2024. 6, 15, 16, 17, 18, 19, 20, 21

  47. [47]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 2, 3

  48. [48]

    On the hidden mystery of ocr in large multimodal models

    Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, et al. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023. 3

  49. [49]

    Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

    Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019. 3 10

  50. [50]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems , 35:2507–2521,

  51. [51]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 3

  52. [52]

    Ok-vqa: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 3

  53. [53]

    GAIA: a benchmark for General AI Assistants

    Gr ´egoire Mialon, Cl´ementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983, 2023. 1, 3

  54. [54]

    Minicpm-v

    MiniCPM. Minicpm-v. https : / / github . com / OpenBMB/MiniCPM, 2024. GitHub Repository. 15, 16, 17, 18, 19, 20, 21

  55. [55]

    Minicpm-v-2, 2024

    MiniCPM. Minicpm-v-2, 2024. 15, 16, 17, 18, 19, 20, 21

  56. [56]

    Metavl: Transferring in-context learning ability from language models to vision-language models

    Masoud Monajatipoor, Liunian Harold Li, Mozhdeh Rouhsedaghat, Lin F Yang, and Kai-Wei Chang. Metavl: Transferring in-context learning ability from language models to vision-language models. arXiv preprint arXiv:2306.01311, 2023. 3

  57. [57]

    Levels of agi: Opera- tionalizing progress on the path to agi

    Meredith Ringel Morris, Jascha Sohl-dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Farabet, and Shane Legg. Levels of agi: Opera- tionalizing progress on the path to agi. arXiv preprint arXiv:2311.02462, 2023. 1, 3, 8

  58. [58]

    Ominilmm-12b

    OminiLMM. Ominilmm-12b. https://github.com/ OpenBMB/OmniLMM, 2024. GitHub Repository. 15, 16, 17, 18, 19, 20, 21

  59. [59]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 1, 6, 15, 16, 17, 18, 19, 20, 21

  60. [60]

    Gpt-4v(ision) system card, 2023

    OpenAI. Gpt-4v(ision) system card, 2023. 2, 6, 7, 15, 16, 17, 18, 19, 20, 21

  61. [61]

    OpenAI. Gpt-4o. 2024. 6, 15, 119

  62. [62]

    Reka core, flash, and edge: A series of powerful multimodal language models

    Aitor Ormazabal, Che Zheng, Cyprien de Masson d’Autume, Dani Yogatama, Deyu Fu, Donovan Ong, et al. Reka core, flash, and edge: A series of powerful multimodal language models. https://publications.reka.ai/reka- core-tech-report.pdf , 2024. 15, 16, 17, 18, 19, 20, 21, 119

  63. [63]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023. 5, 6, 15, 16, 17, 18, 19, 20, 21

  64. [64]

    Qwen-vl-plus

    Qwen. Qwen-vl-plus. https : / / github . com / QwenLM/Qwen-VL?tab=readme-ov-file#qwen- vl-plus, 2023. GitHub Repository. 15, 16, 17, 18, 19, 20, 21

  65. [65]

    Qwen-vl-max

    Qwen. Qwen-vl-max. https : / / github . com / QwenLM/Qwen-VL?tab=readme-ov-file#qwen- vl-max, 2024. GitHub Repository. 6, 15, 16, 17, 18, 19, 20, 21

  66. [66]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 3, 5

  67. [67]

    Faster r-cnn: Towards real-time object detection with region proposal networks

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information process- ing systems, 28, 2015. 3

  68. [68]

    Sensechat-vision, 2024

    sensenova. Sensechat-vision, 2024. 6, 15, 16, 17, 18, 19, 20, 21

  69. [69]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarjan, Meet Shah, Yu Jiang, Xinlei Chen, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019. 2

  70. [70]

    Generative multimodal models are in-context learners

    Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, et al. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023. 15, 16, 17, 18, 19, 20, 21

  71. [71]

    Lxmert: Learning cross- modality encoder representations from transformers

    Hao Tan and Mohit Bansal. Lxmert: Learning cross- modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pages 5100–5111, 2019. 3

  72. [72]

    Introducing the next generation of claude

    Claude Team. Introducing the next generation of claude. https://www.anthropic.com/news/claude-3- family, 2024. 6, 15, 119

  73. [73]

    Infimm: Advancing multimodal understand- ing from flamingo’s legacy through diverse llm integration,

    InfiMM Team. Infimm: Advancing multimodal understand- ing from flamingo’s legacy through diverse llm integration,

  74. [74]

    15, 16, 17, 18, 19, 20, 21

  75. [75]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 1, 5

  76. [76]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 6, 15, 16, 17, 18, 19, 20, 21

  77. [77]

    Evaluation and analysis of hal- lucination in large vision-language models

    Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, et al. Evaluation and analysis of hal- lucination in large vision-language models. arXiv preprint arXiv:2308.15126, 2023. 3

  78. [78]

    Cogvlm: Visual expert for pretrained language 11 models

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language 11 models. arXiv preprint arXiv:2311.03079, 2023. 2, 5, 6, 15, 16, 17, 18, 19, 20, 21

  79. [79]

    Simvlm: Simple visual language model pretraining with weak supervision

    Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision. In International Conference on Learning Representations, 2021. 3

  80. [80]

    Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models

    Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023. 3

Showing first 80 references.