Recognition: no theorem link
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Pith reviewed 2026-05-14 00:45 UTC · model grok-4.3
The pith
Multimodal models show 17 to 27 percent lower accuracy on MMMU-Pro than on MMMU.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MMMU-Pro refines the original MMMU benchmark through a three-step process of filtering text-only solvable questions, augmenting options, and introducing a vision-only setting with embedded questions. This results in model performances dropping between 16.8% and 26.9% compared to MMMU, demonstrating that previous evaluations overestimated multimodal capabilities by allowing non-visual shortcuts.
What carries the argument
The MMMU-Pro benchmark, built by filtering out text-answerable questions, expanding multiple choices, and embedding questions in images for vision-only input.
If this is right
- Multimodal models rely more on text than visual cues in standard benchmarks.
- Chain of thought reasoning boosts performance on the harder MMMU-Pro.
- OCR prompts provide little additional benefit.
- Future multimodal research should prioritize integrated visual-textual reasoning.
- The benchmark better mimics real-world scenarios requiring simultaneous seeing and reading.
Where Pith is reading between the lines
- Developers might need to redesign models to handle text embedded in images more effectively.
- Benchmarks should routinely include vision-only modes to prevent text leakage.
- The performance gap highlights a need for better training data that mixes visual and textual elements inseparably.
Load-bearing premise
Questions solvable by text-only models truly require no visual understanding, and embedding questions in images tests integration without new biases or confounds.
What would settle it
A model achieving comparable accuracy on MMMU-Pro to MMMU, or maintaining high performance in the vision-only embedded setting without specific training for it, would indicate the robustness claim does not hold.
read the original abstract
This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly "see" and "read" simultaneously, testing a fundamental human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future research in multimodal AI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MMMU-Pro, a refined version of the MMMU benchmark for multimodal understanding. It applies a three-step process to the original dataset: (1) filtering questions answerable by text-only models, (2) augmenting candidate options, and (3) embedding questions within images to create a vision-only input setting. Experiments across multiple models report performance drops of 16.8% to 26.9% relative to MMMU, with additional analysis showing minimal impact from OCR prompts and general improvement from Chain-of-Thought reasoning. The work positions MMMU-Pro as a more rigorous test of seamless visual-textual integration.
Significance. If the reported performance drops primarily reflect stricter requirements for multimodal reasoning rather than new perceptual confounds, MMMU-Pro would offer a useful, more challenging benchmark that better approximates real-world scenarios requiring integrated vision and language. The empirical construction is straightforward and the consistent drops across models provide a clear signal of current limitations, though the absence of isolating ablations limits the strength of the causal interpretation.
major comments (2)
- [three-step process and results] The vision-only embedding step (described in the three-step process and results) is central to the claim that MMMU-Pro better measures true visual-textual integration. However, no ablation is presented that compares model performance on the filtered and augmented questions when text is provided directly versus when it is embedded in images. This leaves open the possibility that the 16.8–26.9% drop arises from vision-encoder difficulties with text legibility, layout, or rendering rather than from the intended reasoning challenge.
- [experiments] The observation that OCR prompts have minimal effect (reported in the experiments) does not isolate whether the base visual extraction step itself is the bottleneck, as it tests only explicit prompting rather than the underlying perceptual capability on the embedded images.
minor comments (2)
- [methods] Exact criteria and thresholds used for filtering questions answerable by text-only models are not fully detailed, which would aid reproducibility of the exact MMMU-Pro dataset.
- [results] Consider adding error bars or statistical tests for the performance differences across models to strengthen the quantitative claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and recommendation for minor revision. We address each major comment point by point below.
read point-by-point responses
-
Referee: The vision-only embedding step (described in the three-step process and results) is central to the claim that MMMU-Pro better measures true visual-textual integration. However, no ablation is presented that compares model performance on the filtered and augmented questions when text is provided directly versus when it is embedded in images. This leaves open the possibility that the 16.8–26.9% drop arises from vision-encoder difficulties with text legibility, layout, or rendering rather than from the intended reasoning challenge.
Authors: We acknowledge that a direct ablation comparing text-provided versus embedded versions of the filtered and augmented questions would strengthen the causal interpretation. However, our OCR prompt experiments provide relevant evidence: explicitly supplying extracted text from the embedded images yields only minimal gains. This indicates the drop is unlikely to stem primarily from legibility or rendering issues, but rather from the demands of integrated reasoning. We will add an expanded discussion of this evidence and its implications in the revised manuscript. revision: yes
-
Referee: The observation that OCR prompts have minimal effect (reported in the experiments) does not isolate whether the base visual extraction step itself is the bottleneck, as it tests only explicit prompting rather than the underlying perceptual capability on the embedded images.
Authors: We agree that OCR prompting does not fully isolate inherent perceptual extraction capabilities independent of prompting. That said, the minimal effect even when text is explicitly provided still highlights that current models struggle with the integrated reasoning task central to MMMU-Pro. This supports the benchmark's utility for evaluating seamless visual-textual integration without requiring further changes to the manuscript. revision: no
Circularity Check
No circularity: empirical benchmark construction
full rationale
The paper constructs MMMU-Pro via three explicit steps on the prior MMMU dataset—text-only filtering, option augmentation, and vision-only image embedding—then reports empirical accuracy drops on external models. No equations, parameter fitting, derivations, or self-citations reduce any claim to its own inputs by construction. The performance range (16.8–26.9%) is an observed measurement, not a forced prediction or renamed input. The work is self-contained against external model evaluations and contains no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Questions answerable by text-only models do not require visual understanding and can be safely filtered out.
- domain assumption Embedding questions inside images cleanly tests integrated seeing-and-reading without introducing unrelated biases.
Forward citations
Cited by 22 Pith papers
-
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning
RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
-
COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
COHERENCE is a new benchmark for measuring MLLMs' ability to recover fine-grained image-text correspondences in interleaved multimodal contexts.
-
COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
COHERENCE is a benchmark for MLLMs' fine-grained image-text alignment in interleaved multimodal contexts across four domains, with 6161 questions and six-type error analysis.
-
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
-
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.
-
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
-
Co-Evolving Policy Distillation
CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
-
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
-
Qwen3-Omni Technical Report
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
-
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models
OmniThoughtVis curates 1.8M multimodal CoT samples via teacher distillation, difficulty annotation, and tag-based sampling, yielding consistent gains on nine reasoning benchmarks and allowing 4B models to match or bea...
-
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
-
Qwen3.5-Omni Technical Report
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...
-
SALLIE: Safeguarding Against Latent Language & Image Exploits
SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
-
Qwen2.5-Omni Technical Report
Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...
-
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
-
Measuring AI Reasoning: A Guide for Researchers
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. 2024. https://arxiv.org/abs/2404.14219 Phi-3 technical report: A highly capable language model locally on your phone . ArXiv preprint, abs/2404.14219
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems
work page 2022
-
[3]
Anthropic. 2024. https://www.anthropic.com/news/claude-3-5-sonnet Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-sonnet
work page 2024
-
[4]
Lawrence Zitnick, and Devi Parikh
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. https://doi.org/10.1109/ICCV.2015.279 VQA: visual question answering . In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015 , pages 2425--2433. IEEE Computer Society
-
[5]
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. 2023. https://arxiv.org/abs/2308.01390 Openflamingo: An open-source framework for training large autoregressive vision-language models . ArXiv preprint, abs/2308.01390
work page internal anchor Pith review arXiv 2023
-
[6]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In European Conference on Computer Vision, pages 104--120
work page 2020
-
[7]
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. 2024. https://arxiv.org/abs/2404.16821 How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites . ArXiv preprint, abs/2404.16821
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [8]
-
[9]
Wenliang Dai, Junnan Li, DONGXU LI, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/9a6a435e75419a836fe47ab6793623e6-Paper-Conference.pdf Instructblip: Towards general-purpose vision-language models with instruction tuning . In Advances in Neural Informat...
work page 2023
-
[10]
Mengnan Du, Fengxiang He, Na Zou, Dacheng Tao, and Xia Hu. 2023. Shortcut learning of large language models in natural language understanding. Communications of the ACM, 67(1):110--120
work page 2023
-
[11]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. https://arxiv.org/abs/2407.21783 The llama 3 herd of models . ArXiv preprint, abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [12]
- [13]
-
[14]
Yash Goyal, Tejas Khot, Douglas Summers - Stay, Dhruv Batra, and Devi Parikh. 2017. https://doi.org/10.1109/CVPR.2017.670 Making the V in VQA matter: Elevating the role of image understanding in visual question answering . In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 , pages 6325--6334....
-
[15]
gpt-4o. 2024. https://mistral.ai/news/mixtral-8x22b/ Cheaper, better, faster, stronger. https://mistral.ai/news/mixtral-8x22b/
work page 2024
- [16]
- [17]
-
[18]
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. 2024. https://aclanthology.org/2024.acl-long.50 V isual W eb A rena: Evaluating multimodal agents on realistic visual web tasks . In Proceedings of the 62nd Annual Meeting of the Association for Computational Lingu...
work page 2024
- [19]
-
[20]
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. 2023. https://arxiv.org/abs/2305.03726 Otter: A multi-modal model with in-context instruction tuning . ArXiv preprint, abs/2305.03726
work page internal anchor Pith review arXiv 2023
-
[21]
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2024 a . https://arxiv.org/abs/2408.03326 Llava-onevision: Easy visual task transfer . ArXiv preprint, abs/2408.03326
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. 2024 b . Seed-bench: Benchmarking multimodal large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13299--13308
work page 2024
-
[23]
Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. 2024 c . https://arxiv.org/abs/2407.07895 Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models . ArXiv preprint, abs/2407.07895
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXX 16, pages 121--137. Springer
work page 2020
-
[25]
Zekun Li, Xianjun Yang, Kyuri Choi, Wanrong Zhu, Ryan Hsieh, HyeonJung Kim, Jin Hyuk Lim, Sungyoung Ji, Byungju Lee, Xifeng Yan, et al. 2024 d . https://arxiv.org/abs/2407.04903 Mmsci: A multimodal multi-discipline dataset for phd-level scientific comprehension . ArXiv preprint, abs/2407.04903
-
[26]
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. 2024. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26689--26699
work page 2024
-
[27]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740--755. Springer
work page 2014
-
[28]
Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. 2023 a . https://arxiv.org/abs/2310.14566 Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models . ArXiv preprint, abs/2310.14566
-
[29]
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2023 b . https://openreview.net/forum?id=yx3Hkx5ved Improved baselines with visual instruction tuning . In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following
work page 2023
-
[30]
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024 a . https://llava-vl.github.io/blog/2024-01-30-llava-next/ Llava-next: Improved reasoning, ocr, and world knowledge
work page 2024
-
[31]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023 c . https://proceedings.neurips.cc/paper_files/paper/2023/file/6dcf277ea32ce3288914faf369fe6de0-Paper-Conference.pdf Visual instruction tuning . In Advances in Neural Information Processing Systems, volume 36, pages 34892--34916. Curran Associates, Inc
work page 2023
-
[32]
Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, and Xiang Yue. 2024 b . Visualwebbench: How far have multimodal llms evolved in web page understanding and grounding? Conference on Language Modeling
work page 2024
-
[33]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. 2023 d . https://arxiv.org/abs/2307.06281 Mmbench: Is your multi-modal model an all-around player? ArXiv preprint, abs/2307.06281
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks . In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processi...
work page 2019
-
[35]
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. 2023 a . https://arxiv.org/abs/2310.02255 Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts . ArXiv preprint, abs/2310.02255
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [36]
-
[37]
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. https://doi.org/10.1109/CVPR.2019.00331 OK-VQA: A visual question answering benchmark requiring external knowledge . In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 3195--3204. Computer Vision Foundation / IEEE
-
[38]
Mistral. 2024. https://mistral.ai/news/pixtral-12b Pixtral-12b. https://mistral.ai/news/pixtral-12b
work page 2024
-
[39]
Masoud Monajatipoor, Liunian Harold Li, Mozhdeh Rouhsedaghat, Lin Yang, and Kai-Wei Chang. 2023. https://doi.org/10.18653/v1/2023.acl-short.43 M eta VL : Transferring in-context learning ability from language models to vision-language models . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Paper...
-
[40]
OpenAI. 2023. https://cdn.openai.com/papers/GPTV_System_Card.pdf Gpt-4v(ision) system card
work page 2023
-
[41]
OpenAI. 2024 a . https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ Gpt-4o mini: advancing cost-efficient intelligence. https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/
work page 2024
-
[42]
OpenAI. 2024 b . https://openai.com/index/hello-gpt-4o/ Hello gpt4-o. https://openai.com/index/hello-gpt-4o/
work page 2024
-
[43]
Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Qwen. 2024. Qwen2-vl: To see the world more clearly. https://qwenlm.github.io/blog/qwen2-vl/
work page 2024
-
[45]
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. 2024. https://arxiv.org/abs/2403.05530 Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context . ArXiv preprint, abs/2403.05530
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. https://arxiv.org/abs/2312.11805 Gemini: a family of highly capable multimodal models . ArXiv preprint, abs/2312.11805
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. 2024 a . https://arxiv.org/abs/2406.16860 Cambrian-1: A fully open, vision-centric exploration of multimodal llms . ArXiv preprint, abs/2406.16860
-
[48]
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. 2024 b . Eyes wide shut? exploring the visual shortcomings of multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568--9578
work page 2024
-
[49]
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. 2024. https://arxiv.org/abs/2406.01574 Mmlu-pro: A more robust and challenging multi-task language understanding benchmark . ArXiv preprint, abs/2406.01574
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837
work page 2022
- [51]
-
[52]
Penghao Wu and Saining Xie. 2024. V*: Guided visual search as a core mechanism in multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084--13094
work page 2024
- [53]
-
[54]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. 2024. https://arxiv.org/abs/2407.10671 Qwen2 technical report . ArXiv preprint, abs/2407.10671
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. 2024. https://arxiv.org/abs/2408.01800 Minicpm-v: A gpt-4v level mllm on your phone . ArXiv preprint, abs/2408.01800
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. 2023 a . https://arxiv.org/abs/2304.14178 mplug-owl: Modularization empowers large language models with multimodality . ArXiv preprint, abs/2304.14178
work page Pith review arXiv 2023
- [57]
- [58]
-
[59]
Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Xiaoshui Huang, Zhiyong Wang, Lu Sheng, LEI BAI, Jing Shao, and Wanli Ouyang. 2023 b . https://proceedings.neurips.cc/paper_files/paper/2023/file/548a41b9cac6f50dccf7e63e9e1b1b9b-Paper-Datasets_and_Benchmarks.pdf Lamm: Language-assisted multi-modal instruction-tuning dataset, frame...
work page 2023
-
[60]
Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. 2024. https://arxiv.org/abs/2403.04652 Yi: Open foundation models by 01. ai . ArXiv preprint, abs/2403.04652
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. 2024. https://proceedings.mlr.press/v235/yu24o.html MM -vet: Evaluating large multimodal models for integrated capabilities . In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Rese...
work page 2024
-
[62]
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. 2024. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556--9567
work page 2024
-
[63]
Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. 2023. When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations
work page 2023
-
[64]
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975--11986
work page 2023
-
[65]
Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, Songyang Zhang, Wenwei Zhang, Yining Li, Yang Gao, Peng Sun, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Hang Yan, Conghui He, Xingcheng Zhang, Kai Chen, Jifeng Dai, Yu Qiao, Dahua Lin, and Jiaqi Wang. 2024 a . https://arxiv.org/abs/24...
-
[66]
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. https://doi.org/10.1109/CVPR46437.2021.00553 Vinvl: Revisiting visual representations in vision-language models . In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021 , pages 5579--5588. Computer ...
-
[67]
Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. 2023. https://arxiv.org/abs/2303.16199 Llama-adapter: Efficient fine-tuning of language models with zero-init attention . ArXiv preprint, abs/2303.16199
work page Pith review arXiv 2023
- [68]
- [69]
-
[70]
Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, and Baobao Chang. 2024. https://arxiv.org/abs/2309.07915 Mmicl: Empowering vision-language model with multi-modal in-context learning . The Twelfth International Conference on Learning Representations
-
[71]
Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. 2024. https://openreview.net/forum?id=piecKJ2DlB Gpt-4v(ision) is a generalist web agent, if grounded . In Forty-first International Conference on Machine Learning
work page 2024
-
[72]
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. 2020. https://aaai.org/ojs/index.php/AAAI/article/view/7005 Unified vision-language pre-training for image captioning and VQA . In Proceedings of the AAAI Conference on Artificial Intelligence, 34, pages 13041--13049. AAAI Press
work page 2020
-
[73]
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. https://arxiv.org/abs/2304.10592 Minigpt-4: Enhancing vision-language understanding with advanced large language models . ArXiv preprint, abs/2304.10592
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.