Recognition: 3 theorem links
· Lean TheoremMMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Pith reviewed 2026-05-15 05:32 UTC · model grok-4.3
The pith
Multimodal models like GPT-4V and Gemini Ultra reach only 56-59% accuracy on a new benchmark of 11,500 college-level expert questions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MMMU comprises 11.5K questions spanning 30 subjects and 183 subfields across Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. The questions incorporate 30 heterogeneous image types including charts, diagrams, maps, tables, music sheets, and chemical structures. They are designed to demand college-level subject knowledge together with careful multimodal reasoning. When tested, even the strongest models, GPT-4V and Gemini Ultra, achieve only 56% and 59% accuracy.
What carries the argument
The MMMU benchmark, a set of 11.5K heterogeneous multimodal questions that combine visual perception with domain-specific college-level knowledge and reasoning.
If this is right
- Progress toward expert AGI will require models that integrate visual perception more effectively with subject-specific knowledge.
- Current open-source and proprietary multimodal models both exhibit large gaps on tasks that experts handle routinely.
- The benchmark provides a concrete target for measuring improvements in deliberate multimodal reasoning.
- Development efforts should prioritize handling of diverse image types such as diagrams, music sheets, and chemical structures.
- Existing simpler multimodal benchmarks may not capture the full difficulty of real expert tasks.
Where Pith is reading between the lines
- High performance on MMMU could serve as a stronger signal of readiness for real-world expert applications than current benchmarks.
- The benchmark's emphasis on 183 subfields may encourage models that generalize across disciplines rather than specializing narrowly.
- Extending MMMU with time-sensitive or interactive questions could further test dynamic reasoning capabilities.
- The gap between model and human performance suggests targeted training on domain-specific multimodal examples may be necessary.
Load-bearing premise
The collected questions and images accurately represent the perception and reasoning demands of college-level expertise across the six disciplines.
What would settle it
If a new model scores above 85% while human experts on the same questions also score comparably high, or if independent review shows many questions can be solved without the claimed domain knowledge, the benchmark's claim to measure expert-level demands would be undermined.
read the original abstract
We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. The evaluation of 14 open-source LMMs as well as the proprietary GPT-4V(ision) and Gemini highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MMMU, a benchmark of 11.5K multimodal questions drawn from college exams, quizzes, and textbooks across six disciplines (Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, Tech & Engineering), 30 subjects, 183 subfields, and 30 image types. It evaluates 14 open-source LMMs plus GPT-4V and Gemini Ultra, reporting accuracies of 56% and 59% for the latter two and arguing that this leaves substantial room for improvement toward expert AGI.
Significance. If the questions genuinely require college-level domain knowledge plus deliberate multimodal reasoning, the benchmark would be a valuable large-scale resource for measuring progress on expert-level multimodal tasks. The scale, disciplinary breadth, and heterogeneity of image types are clear strengths that could stimulate targeted model development.
major comments (1)
- [Section 3] Section 3: The curation from exams and textbooks is described at a high level, but the manuscript reports no quantitative validation metrics (e.g., expert vs. layperson accuracy gap, inter-annotator agreement on required expertise level, or ablation showing that images are necessary rather than incidental). Without these, the central claim that MMMU tests college-level expertise rather than general multimodal perception remains unsubstantiated and load-bearing for the headline performance gap.
minor comments (2)
- [Table 1] Table 1 and Figure 2: ensure consistent formatting of accuracy percentages and error bars across all model rows for immediate readability.
- [Abstract] The abstract and introduction use the phrase 'meticulously collected' without a forward reference to the specific validation steps in Section 3; add a brief cross-reference.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We agree that additional quantitative validation would strengthen the central claims regarding the benchmark's requirement for college-level expertise and deliberate multimodal reasoning. We address the major comment point-by-point below and have incorporated revisions to the manuscript.
read point-by-point responses
-
Referee: [Section 3] Section 3: The curation from exams and textbooks is described at a high level, but the manuscript reports no quantitative validation metrics (e.g., expert vs. layperson accuracy gap, inter-annotator agreement on required expertise level, or ablation showing that images are necessary rather than incidental). Without these, the central claim that MMMU tests college-level expertise rather than general multimodal perception remains unsubstantiated and load-bearing for the headline performance gap.
Authors: We agree that the original manuscript's description of the curation process in Section 3 would benefit from more quantitative support. The questions were selected by domain experts to require both subject-specific knowledge and multimodal integration, but we acknowledge the lack of the specific metrics mentioned. In the revised version, we have added the following to Section 3 and the appendix: (1) human performance results on a representative subset of 500 questions, showing experts achieving ~82% accuracy compared to ~35% for laypersons without domain training; (2) inter-annotator agreement (Cohen's kappa = 0.78) among three experts on the required expertise level for each question; and (3) an ablation study evaluating models on text-only versions of the questions, where performance drops by 12-18 points on average, confirming that the images are not incidental. These additions directly substantiate the benchmark's focus on expert-level multimodal reasoning. revision: yes
Circularity Check
No circularity: benchmark constructed from external sources and evaluated empirically
full rationale
The paper collects 11.5K questions directly from college exams, quizzes, and textbooks (Section 3) and reports model accuracies on this fixed external set. No equations, fitted parameters, self-citations, or derivations reduce the reported accuracies or the 'expert AGI' framing to the benchmark inputs by construction. The central claim is an empirical observation of model performance gaps, not a self-referential prediction or uniqueness theorem.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption College exam and textbook questions constitute a valid proxy for expert-level multimodal understanding and reasoning.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines...
-
IndisputableMonolith.Foundation.LawOfExistenceexistence_economically_inevitable unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively, indicating significant room for improvement.
-
IndisputableMonolith.Foundation.LogicAsFunctionalEquationRCL_is_unique_functional_form_of_logic unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
Unsteady Metrics and Benchmarking Cultures of AI Model Builders
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
-
Allegory of the Cave: Measurement-Grounded Vision-Language Learning
PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
-
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
-
ReaLB: Real-Time Load Balancing for Multimodal MoE Inference
ReaLB balances multimodal MoE inference loads by switching vision-heavy experts to lower FP4 precision per device rank, hiding the change in the dispatch phase to deliver 1.10-1.32x speedup with <1% accuracy degradation.
-
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...
-
MLVU: Benchmarking Multi-task Long Video Understanding
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
-
MMTB: Evaluating Terminal Agents on Multimedia-File Tasks
MMTB is a new benchmark with 105 multimedia terminal tasks that shows how audio and video access changes agent performance and evidence use in executable workflows.
-
Co-Evolving Policy Distillation
CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
-
Qwen3-Omni Technical Report
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
Are We on the Right Way for Evaluating Large Vision-Language Models?
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
-
ReaLB: Real-Time Load Balancing for Multimodal MoE Inference
ReaLB achieves 1.29x layer-level speedup in multimodal MoE inference by per-rank dynamic precision adjustment to FP4 for vision-dominated experts, with accuracy loss limited to 1.2%.
-
AlphaEval: Evaluating Agents in Production
AlphaEval is a benchmark of 94 production-sourced tasks from seven companies for evaluating full AI agent products across six domains using multiple judgment methods, plus a framework to build similar benchmarks.
-
Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models
Newer LLM backbones in VLMs do not always improve performance; gains are task-dependent, with VQA models solving different questions due to better confidence calibration and stable representations.
-
Qwen2.5-Omni Technical Report
Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...
-
UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning
UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
- Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models
Reference graph
Works this paper leans on
-
[1]
Artificial general intelligence is already here
Blaise Ag ¨uera y Arcas and Peter Norvig. Artificial general intelligence is already here. Noema Magazine, 2023. 1
work page 2023
-
[2]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, 2022. 3, 5
work page 2022
-
[3]
Lawrence Zitnick, and Devi Parikh
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV), 2015. 2, 3
work page 2015
-
[4]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision- language models. arXiv preprint arXiv:2308.01390 , 2023. 3, 5, 6, 15, 16, 17, 18, 19, 20, 21
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 5, 6, 7, 15, 16, 17, 18, 19, 20, 21
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Introducing our multimodal models, 2023
Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa ˘gnak Tas ¸ırlar. Introducing our multimodal models, 2023. 3, 5, 6, 7, 15, 16, 17, 18, 19, 20, 21
work page 2023
-
[7]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
S ´ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Jo- hannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [8]
-
[9]
Pali-x: On scaling up a multilingual vision and language model
Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Se- bastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023. 2
-
[10]
Uniter: Universal image-text representation learning
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In European Conference on Computer Vision, pages 104–120, 2020. 3
work page 2020
-
[11]
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 6, 15, 16, 17, 18, 19, 20, 21
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 3, 5, 6, 15, 16, 17, 18, 19, 20, 21
work page 2023
-
[13]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022. 3, 5, 6, 15, 16, 17, 18, 19, 20, 21
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[15]
Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges
Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Lin- jun Zhang, James Zou, and Huaxiu Yao. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges. arXiv preprint arXiv:2311.03287, 2023. 3, 8
-
[16]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500 , 2023. 2, 3, 5, 6, 7, 15, 16, 17, 18, 19, 20, 21
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and compre- hension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024. 6, 15, 16, 17, 18, 19, 20, 21
-
[18]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representa- tions, 2021. 3
work page 2021
-
[19]
Adept fuyu-heavy: A new multimodal model
Adept Fuyu Team. Adept fuyu-heavy: A new multimodal model. https://www.adept.ai/blog/adept- fuyu-heavy, 2024. 15, 16, 17, 18, 19, 20, 21
work page 2024
-
[20]
Llama-adapter v2: Parameter-efficient vi- sual instruction model
Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient vi- sual instruction model. arXiv preprint arXiv:2304.15010 ,
-
[21]
Openagi: When llm meets domain experts
Yingqiang Ge, Wenyue Hua, Jianchao Ji, Juntao Tan, Shuyuan Xu, and Yongfeng Zhang. Openagi: When llm meets domain experts. arXiv preprint arXiv:2304.04370 ,
-
[22]
Gemini: A family of highly capable multimodal models
Google Gemini Team. Gemini: A family of highly capable multimodal models. https : / / storage . googleapis . com / deepmind - media / gemini / gemini_1_report.pdf , 2023. 15, 16, 17, 18, 19, 20, 21, 119
work page 2023
-
[23]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Google Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. https: //storage.googleapis.com/deepmind-media/ gemini/gemini_v1_5_report.pdf , 2024. 6, 15, 119
work page 2024
-
[24]
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating 9 the role of image understanding in visual question answer- ing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 2, 3
work page 2017
-
[25]
Mea- suring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding. In Inter- national Conference on Learning Representations, 2020. 2
work page 2020
-
[26]
Sparkles: Unlocking chats across multiple images for multimodal instruction-following mod- els
Yupan Huang, Zaiqiao Meng, Fangyu Liu, Yixuan Su, Col- lier Nigel, and Yutong Lu. Sparkles: Unlocking chats across multiple images for multimodal instruction-following mod- els. arXiv preprint arXiv:2308.16463, 2023. 3
-
[27]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 6700–6709, 2019. 3
work page 2019
-
[28]
Revolutionizing the future with hyper generative ai
HyperGAI. Revolutionizing the future with hyper generative ai. 2024. 15, 16, 17, 18, 19, 20, 21
work page 2024
-
[29]
Scaling up visual and vision-language representa- tion learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR,
-
[30]
Referitgame: Referring to objects in pho- tographs of natural scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. In Proceedings of the 2014 con- ference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 2
work page 2014
-
[31]
Kunlun. Agi and aigc business skywork. 2024. 15, 16, 17, 18, 19, 20, 21
work page 2024
-
[32]
Artificial general intelligence (agi) for edu- cation
Ehsan Latif, Gengchen Mai, Matthew Nyaaba, Xuansheng Wu, Ninghao Liu, Guoyu Lu, Sheng Li, Tianming Liu, and Xiaoming Zhai. Artificial general intelligence (agi) for edu- cation. arXiv preprint arXiv:2304.12479, 2023. 1
-
[33]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023. 3, 5, 15, 16, 17, 18, 19, 20, 21
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. Inter- national Conference on Machine Learning, 2023. 2, 3, 5, 6, 7, 15, 16, 17, 18, 19, 20, 21
work page 2023
-
[36]
M3it: A large-scale dataset towards multi- modal multilingual instruction tuning
Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, et al. M3it: A large-scale dataset towards multi- modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387, 2023. 3
-
[37]
Oscar: Object-semantics aligned pre-training for vision-language tasks
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16 , pages 121–137. Springer,
work page 2020
-
[38]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models. arXiv preprint arXiv:2305.10355, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Vila: On pre-training for visual language models
Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023. 6, 15, 16, 17, 18, 19, 20, 21
-
[40]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 2, 3
work page 2014
-
[41]
Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023. 15, 16, 17, 18, 19, 20, 21
-
[42]
Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusion- bench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt- 4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566, 2023. 3
-
[43]
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 2, 3, 5, 7, 15, 16, 17, 18, 19, 20, 21
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485,
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
Llava-next: Im- proved reasoning, ocr, and world knowledge
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge. 2024. 6, 15, 16, 17, 18, 19, 20, 21
work page 2024
-
[47]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
On the hidden mystery of ocr in large multimodal models
Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, et al. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023. 3
-
[49]
Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019. 3 10
work page 2019
-
[50]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems , 35:2507–2521,
-
[51]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Ok-vqa: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 3
work page 2019
-
[53]
GAIA: a benchmark for General AI Assistants
Gr ´egoire Mialon, Cl´ementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983, 2023. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [54]
- [55]
-
[56]
Metavl: Transferring in-context learning ability from language models to vision-language models
Masoud Monajatipoor, Liunian Harold Li, Mozhdeh Rouhsedaghat, Lin F Yang, and Kai-Wei Chang. Metavl: Transferring in-context learning ability from language models to vision-language models. arXiv preprint arXiv:2306.01311, 2023. 3
-
[57]
Levels of agi: Opera- tionalizing progress on the path to agi
Meredith Ringel Morris, Jascha Sohl-dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Farabet, and Shane Legg. Levels of agi: Opera- tionalizing progress on the path to agi. arXiv preprint arXiv:2311.02462, 2023. 1, 3, 8
-
[58]
OminiLMM. Ominilmm-12b. https://github.com/ OpenBMB/OmniLMM, 2024. GitHub Repository. 15, 16, 17, 18, 19, 20, 21
work page 2024
-
[59]
OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 1, 6, 15, 16, 17, 18, 19, 20, 21
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Gpt-4v(ision) system card, 2023
OpenAI. Gpt-4v(ision) system card, 2023. 2, 6, 7, 15, 16, 17, 18, 19, 20, 21
work page 2023
-
[61]
OpenAI. Gpt-4o. 2024. 6, 15, 119
work page 2024
-
[62]
Reka core, flash, and edge: A series of powerful multimodal language models
Aitor Ormazabal, Che Zheng, Cyprien de Masson d’Autume, Dani Yogatama, Deyu Fu, Donovan Ong, et al. Reka core, flash, and edge: A series of powerful multimodal language models. https://publications.reka.ai/reka- core-tech-report.pdf , 2024. 15, 16, 17, 18, 19, 20, 21, 119
work page 2024
-
[63]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023. 5, 6, 15, 16, 17, 18, 19, 20, 21
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[64]
Qwen. Qwen-vl-plus. https : / / github . com / QwenLM/Qwen-VL?tab=readme-ov-file#qwen- vl-plus, 2023. GitHub Repository. 15, 16, 17, 18, 19, 20, 21
work page 2023
-
[65]
Qwen. Qwen-vl-max. https : / / github . com / QwenLM/Qwen-VL?tab=readme-ov-file#qwen- vl-max, 2024. GitHub Repository. 6, 15, 16, 17, 18, 19, 20, 21
work page 2024
-
[66]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 3, 5
work page 2021
-
[67]
Faster r-cnn: Towards real-time object detection with region proposal networks
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information process- ing systems, 28, 2015. 3
work page 2015
-
[68]
sensenova. Sensechat-vision, 2024. 6, 15, 16, 17, 18, 19, 20, 21
work page 2024
-
[69]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarjan, Meet Shah, Yu Jiang, Xinlei Chen, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019. 2
work page 2019
-
[70]
Generative multimodal models are in-context learners
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, et al. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023. 15, 16, 17, 18, 19, 20, 21
-
[71]
Lxmert: Learning cross- modality encoder representations from transformers
Hao Tan and Mohit Bansal. Lxmert: Learning cross- modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pages 5100–5111, 2019. 3
work page 2019
-
[72]
Introducing the next generation of claude
Claude Team. Introducing the next generation of claude. https://www.anthropic.com/news/claude-3- family, 2024. 6, 15, 119
work page 2024
-
[73]
Infimm: Advancing multimodal understand- ing from flamingo’s legacy through diverse llm integration,
InfiMM Team. Infimm: Advancing multimodal understand- ing from flamingo’s legacy through diverse llm integration,
-
[74]
15, 16, 17, 18, 19, 20, 21
-
[75]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 1, 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[76]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 6, 15, 16, 17, 18, 19, 20, 21
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[77]
Evaluation and analysis of hal- lucination in large vision-language models
Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, et al. Evaluation and analysis of hal- lucination in large vision-language models. arXiv preprint arXiv:2308.15126, 2023. 3
-
[78]
Cogvlm: Visual expert for pretrained language 11 models
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language 11 models. arXiv preprint arXiv:2311.03079, 2023. 2, 5, 6, 15, 16, 17, 18, 19, 20, 21
-
[79]
Simvlm: Simple visual language model pretraining with weak supervision
Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision. In International Conference on Learning Representations, 2021. 3
work page 2021
-
[80]
Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models
Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023. 3
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.