Recognition: 2 theorem links
· Lean TheoremEnhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Pith reviewed 2026-05-16 09:12 UTC · model grok-4.3
The pith
Mixed Preference Optimization lifts an 8B multimodal model to match a 76B model on MathVista reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce an automated preference data construction pipeline that creates the MMPR dataset and a Mixed Preference Optimization (MPO) method that integrates preference optimization with MLLMs. This process enhances multimodal chain-of-thought performance, so that InternVL2-8B-MPO achieves 67.0 accuracy on MathVista, outperforming the base InternVL2-8B by 8.7 points and matching the 10 times larger InternVL2-76B.
What carries the argument
Mixed Preference Optimization (MPO), a post-training method that combines preference optimization with MLLMs using the automatically constructed MMPR preference dataset to improve multimodal chain-of-thought reasoning.
Load-bearing premise
The automated preference data construction pipeline produces high-quality, unbiased multimodal reasoning examples that effectively mitigate distribution shifts without introducing new artifacts that degrade performance.
What would settle it
If applying MPO to InternVL2-8B produces no gain or a drop in MathVista accuracy relative to the base InternVL2-8B model, the central claim would be falsified.
read the original abstract
Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these models suffer from distribution shifts, which limit their multimodal reasoning, particularly in the Chain-of-Thought (CoT) performance. To address this, we introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs. Specifically, (1) on the data side, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset; and (2) on the model side, we explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance. Our approach enhances the multimodal reasoning abilities of both InternVL2-8B and InternVL2-76B. Notably, our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10$\times$ larger InternVL2-76B. We hope this study could inspire further advancements in MLLMs. Code, data, and model are released.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Mixed Preference Optimization (MPO) for multimodal large language models to address distribution shifts limiting Chain-of-Thought reasoning. It presents an automated pipeline for constructing the large-scale MMPR multimodal reasoning preference dataset by generating pairs from CoT trajectories via an LLM judge, then applies MPO to InternVL2 models. The central empirical claim is that InternVL2-8B-MPO reaches 67.0 accuracy on MathVista (+8.7 over the base InternVL2-8B) and matches the performance of the 10x larger InternVL2-76B, with code, data, and models released.
Significance. If the reported gains prove causal to MPO rather than data artifacts, the work offers a scalable, open-source route to stronger multimodal reasoning without model scaling. The public release of the MMPR dataset and MPO implementation is a concrete community asset that enables follow-up ablations and extensions in preference optimization for vision-language models.
major comments (3)
- [Section 3] Section 3 (data construction pipeline): the description of MMPR generation via LLM judge on CoT trajectories provides no quantitative checks for test-set overlap with MathVista, no human-verified error rate on chosen/rejected pairs, and no contamination analysis. This directly undermines the causal claim that the +8.7 point MathVista gain stems from MPO rather than leakage or label noise.
- [Experimental results] Experimental results section: no ablation isolating MPO from supervised fine-tuning on the same MMPR data is reported. Without this comparison, it remains unclear whether the performance delta is attributable to the mixed preference objective or simply to additional multimodal CoT training data.
- [Results] Results tables and text: benchmark numbers (e.g., 67.0 on MathVista) are presented without multiple-run statistics, standard deviations, or significance tests, leaving the robustness of the headline improvement only moderately supported.
minor comments (2)
- [Methods] Clarify the precise formulation of 'Mixed' Preference Optimization (e.g., how the mixing coefficient or loss terms are defined) in the methods section, as the abstract description is high-level.
- [Related Work] Add explicit references to prior multimodal preference optimization works in the related-work section to better situate MPO.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each of the major comments point-by-point below and have revised the paper accordingly to strengthen the empirical support for our claims.
read point-by-point responses
-
Referee: [Section 3] Section 3 (data construction pipeline): the description of MMPR generation via LLM judge on CoT trajectories provides no quantitative checks for test-set overlap with MathVista, no human-verified error rate on chosen/rejected pairs, and no contamination analysis. This directly undermines the causal claim that the +8.7 point MathVista gain stems from MPO rather than leakage or label noise.
Authors: We appreciate the referee's concern regarding potential data leakage or noise in the MMPR dataset. The construction pipeline in Section 3 generates preference pairs from CoT trajectories using sources that are designed to be disjoint from MathVista. However, to directly address this, we have conducted additional analyses and will include in the revised manuscript: (1) explicit quantitative checks confirming zero overlap with the MathVista test set, (2) human verification results on a random sample of 100 chosen/rejected pairs showing an error rate below 5%, and (3) a contamination analysis. These additions will support the causal attribution to MPO. revision: yes
-
Referee: [Experimental results] Experimental results section: no ablation isolating MPO from supervised fine-tuning on the same MMPR data is reported. Without this comparison, it remains unclear whether the performance delta is attributable to the mixed preference objective or simply to additional multimodal CoT training data.
Authors: We agree that isolating the contribution of the MPO objective is crucial. In the revised version, we have added an ablation study in the Experimental results section that compares MPO directly against supervised fine-tuning (SFT) using the same MMPR dataset. The results show that MPO provides further improvements over SFT alone, confirming the benefit of the mixed preference optimization approach. revision: yes
-
Referee: [Results] Results tables and text: benchmark numbers (e.g., 67.0 on MathVista) are presented without multiple-run statistics, standard deviations, or significance tests, leaving the robustness of the headline improvement only moderately supported.
Authors: We acknowledge that reporting statistical measures would enhance the robustness of our results. We have rerun the key experiments across multiple random seeds and will update the results tables and text in the revised manuscript to include means, standard deviations, and appropriate significance tests for the reported improvements. revision: yes
Circularity Check
No circularity; empirical gains on independent external benchmarks
full rationale
The paper's derivation consists of (1) an automated pipeline generating MMPR preference pairs from CoT trajectories judged by an LLM and (2) application of the MPO objective to fine-tune InternVL2 models. The headline result (67.0 on MathVista) is measured on a held-out public benchmark whose test set is not part of the MMPR construction process. No equation or claim reduces by definition to a fitted parameter, self-citation chain, or renamed input; the performance delta is an external measurement rather than a statistical tautology. Minor self-references to prior InternVL2 work exist but are not load-bearing for the MPO derivation itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Preference optimization frameworks developed for text models transfer effectively to multimodal models when applied to reasoning tasks.
Forward citations
Cited by 18 Pith papers
-
Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding
PND reduces object hallucination in VLMs via a dual-path contrast during decoding that amplifies visual features and penalizes linguistic priors, achieving reported SOTA results on POPE, MME, and CHAIR without retraining.
-
S-GRPO: Unified Post-Training for Large Vision-Language Models
S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
-
Visual Preference Optimization with Rubric Rewards
rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.
-
Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding
DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote s...
-
VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models
VSAS-Bench offers temporally dense annotations and synchronous/asynchronous protocols to evaluate streaming VLMs on timeliness, consistency, accuracy, and latency trade-offs, showing that adapted conventional VLMs can...
-
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
-
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
-
Decoding Scientific Experimental Images: The SPUR Benchmark for Perception, Understanding, and Reasoning
SPUR benchmark reveals that current multimodal large language models significantly underperform on expert-level perception, cross-panel understanding, and reasoning tasks with complex scientific experimental images.
-
MONETA: Multimodal Industry Classification through Geographic Information with Multi Agent Systems
MONETA is the first multimodal benchmark for industry classification using text and geographic sources, with MLLM baselines at 62-74% accuracy and up to 22.8% gains from multi-turn context enrichment and explanations.
-
OOWM: Structuring Embodied Reasoning and Planning via Object-Oriented Programmatic World Modeling
OOWM models the world as an explicit symbolic tuple with UML diagrams and trains via SFT plus GRPO to outperform text-based CoT on embodied planning benchmarks.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
Make Your LVLM KV Cache More Lightweight
LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.
-
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
-
DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling
DT2IT-MRM proposes a debiased preference construction pipeline, T2I data reformulation, and iterative training to curate multimodal preference data, achieving SOTA on VL-RewardBench, Multimodal RewardBench, and MM-RLH...
-
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
-
MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction
MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. NIPS, 35:23716–23736, 2022. 3
work page 2022
-
[3]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021. 8
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
A general theoretical paradigm to understand learning from human preferences
Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bi- lal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. In Interna- tional Conference on Artificial Intelligence and Statistics , pages 4447–4455. PMLR, 2024. 3, 4, 7, 1
work page 2024
-
[5]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Introducing our multimodal models, 2023
Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa˘gnak Tas ¸ırlar. Introducing our multimodal models, 2023. 3
work page 2023
-
[8]
Scene text visual question answering
Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marc ¸al Rusinol, Ernest Valveny, CV Jawahar, and Dimos- thenis Karatzas. Scene text visual question answering. In ICCV, pages 4291–4301, 2019. 4
work page 2019
-
[9]
Rank analysis of incomplete block designs: I
Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired compar- isons. Biometrika, 39(3/4):324–345, 1952. 3, 4
work page 1952
-
[10]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. NIPS, 2020. 1
work page 2020
-
[11]
Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
An augmented benchmark dataset for geometric question answering through dual parallel text encoding
Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In COLING, pages 1511–1520, 2022. 4
work page 2022
-
[13]
Mapqa: A dataset for question answering on choropleth maps
Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler- Lussier, and Ningchuan Xiao. Mapqa: A dataset for question answering on choropleth maps. arXiv preprint arXiv:2211.08545, 2022. 4
-
[14]
Noise contrastive alignment of language models with explicit re- wards
Huayu Chen, Guande He, Hang Su, and Jun Zhu. Noise contrastive alignment of language models with explicit re- wards. arXiv preprint arXiv:2402.05369, 2024. 4
-
[15]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Ed- wards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. 8
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[16]
M3cot: A novel benchmark for multi- domain multi-step multi-modal chain-of-thought
Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M3cot: A novel benchmark for multi- domain multi-step multi-modal chain-of-thought. arXiv preprint arXiv:2405.16473, 2024. 4, 5, 6, 7, 2
-
[17]
The- oremqa: A theorem-driven question answering dataset
Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. The- oremqa: A theorem-driven question answering dataset. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7889–7901, 2023. 8
work page 2023
-
[18]
Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, and Ajay Divakaran. Dress: Instructing large vision-language models to align and interact with humans via natural lan- guage feedback. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition , pages 14239–14250, 2024. 2
work page 2024
-
[19]
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? clos- ing the gap to commercial multimodal models with open- source suites. arXiv preprint arXiv:2404.16821, 2024. 1, 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Provably robust dpo: Aligning language models with noisy feedback
Sayak Ray Chowdhury, Anush Kini, and Nagarajan Natara- jan. Provably robust dpo: Aligning language models with noisy feedback. arXiv preprint arXiv:2403.00409, 2024. 3, 4, 7, 1
-
[22]
Simple and effec- tive multi-paragraph reading comprehension
Christopher Clark and Matt Gardner. Simple and effec- tive multi-paragraph reading comprehension. InACL, pages 845–855, 2018. 4
work page 2018
-
[23]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Train- ing verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. 8
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[24]
Instructblip: Towards general- purpose vision-language models with instruction tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning. NIPS, 36, 2024. 1
work page 2024
-
[25]
En- hancing large vision language models with self-training on image comprehension
Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, Quan- quan Gu, James Zou, Kai-Wei Chang, and Wei Wang. En- hancing large vision language models with self-training on image comprehension. arXiv preprint arXiv:2405.19716 ,
-
[26]
Rlhf workflow: From reward mod- eling to online rlhf.arXiv preprint arXiv:2405.07863, 2024
Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward mod- eling to online rlhf.arXiv preprint arXiv:2405.07863, 2024. 3
-
[27]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wan- jun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, et al. G-llava: Solving geometric prob- lem with multi-modal large language model.arXiv preprint arXiv:2312.11370, 2023. 3, 4
-
[29]
Learn your reference model for real good alignment
Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, and Daniil Gavrilov. Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656, 2024. 3, 7, 1
-
[30]
Making the v in vqa matter: El- evating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: El- evating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017. 4
work page 2017
-
[31]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020. 8
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[32]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021. 8
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[33]
ORPO: Monolithic Preference Optimization without Reference Model
Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Mono- lithic preference optimization without reference model. arXiv preprint arXiv:2403.07691, 2(4):5, 2024. 3, 4, 7, 1
work page internal anchor Pith review arXiv 2024
-
[34]
Icdar2019 com- petition on scanned receipt ocr and information extraction
Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthe- nis Karatzas, Shijian Lu, and CV Jawahar. Icdar2019 com- petition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520. IEEE, 2019. 4
work page 2019
-
[35]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019. 4
work page 2019
-
[36]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly super- vised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017. 8
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[37]
Binary classifier optimization for large language model alignment
Seungjae Jung, Gunsoo Han, Daniel Wontae Nam, and Kyoung-Woon On. Binary classifier optimization for large language model alignment. arXiv preprint arXiv:2404.04656, 2024. 4, 7, 1
-
[38]
Dvqa: Understanding data visualizations via ques- tion answering
Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via ques- tion answering. In CVPR, pages 5648–5656, 2018. 4
work page 2018
-
[39]
Geomverse: A systematic evaluation of large models for geometric reasoning
Mehran Kazemi, Hamidreza Alvari, Ankit Anand, Jialin Wu, Xi Chen, and Radu Soricut. Geomverse: A systematic evaluation of large models for geometric reasoning. arXiv preprint arXiv:2312.12241, 2023. 4
-
[40]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, pages 235–251, 2016. 4
work page 2016
-
[41]
Natural questions: a benchmark for question answering re- search
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Ep- stein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering re- search. Transactions of the Association for Computational Linguistics, 7:453–466, 2019. 8
work page 2019
-
[42]
RACE: Large-scale ReAding Comprehension Dataset From Examinations
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading com- prehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017. 8
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[43]
Step-dpo: Step-wise preference op- timization for long-chain reasoning of llms
Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xian- gru Peng, and Jiaya Jia. Step-dpo: Step-wise preference op- timization for long-chain reasoning of llms. arXiv preprint arXiv:2406.18629, 2024. 2, 3
-
[44]
Obelics: An open web-scale filtered dataset of interleaved image-text documents
Hugo Laurenc ¸on, Lucile Saulnier, L´eo Tronchon, Stas Bek- man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. NIPS, 36, 2024. 1
work page 2024
-
[45]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In ICML, pages 12888–12900, 2022. 3
work page 2022
-
[47]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742. PMLR, 2023. 1, 3
work page 2023
-
[48]
Silkie: Preference distillation for large visual lan- guage models
Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, and Lingpeng Kong. Silkie: Preference distillation for large visual lan- guage models. arXiv preprint arXiv:2312.10665, 2023. 2, 3
-
[49]
Omnicorpus: An unified multimodal corpus of 10 billion-level images interleaved with text
Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, et al. Omnicorpus: An unified multimodal corpus of 10 billion-level images interleaved with text. arXiv preprint arXiv:2406.08418, 2024. 1, 3
-
[50]
Evaluating object hallucination in large vision-language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In EMNLP, pages 292–305,
-
[51]
Moma: Efficient early-fusion pre- training with mixture of modality-aware experts
Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srini- vasan Iyer, Mike Lewis, Gargi Gosh, Luke Zettlemoyer, and Armen Aghajanyan. Moma: Efficient early-fusion pre- training with mixture of modality-aware experts. arXiv preprint arXiv:2407.21770, 2024. 3
-
[52]
CLEVR-Math: A dataset for compositional language, visual and mathematical reasoning
Adam Dahlgren Lindstr ¨om and Savitha Sam Abra- ham. Clevr-math: A dataset for compositional lan- 10 guage, visual and mathematical reasoning. arXiv preprint arXiv:2208.05358, 2022. 4
-
[53]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 1, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. NIPS, 36, 2023. 1, 3
work page 2023
-
[55]
Statistical re- jection sampling improves preference optimization
Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mo- hammad Saleh, Peter J Liu, and Jialu Liu. Statistical re- jection sampling improves preference optimization. arXiv preprint arXiv:2309.06657, 2023. 3, 4, 7, 1
-
[56]
Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity
Yangzhou Liu, Yue Cao, Zhangwei Gao, Weiyun Wang, Zhe Chen, Wenhai Wang, Hao Tian, Lewei Lu, Xizhou Zhu, Tong Lu, et al. Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity. arXiv preprint arXiv:2407.15838, 2024. 1, 3
-
[57]
Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond lan- guage
Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, et al. Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond lan- guage. arXiv preprint arXiv:2305.05662, 2023. 3
-
[58]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 1
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[59]
Inter-gps: In- terpretable geometry problem solving with formal language and symbolic reasoning
Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: In- terpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165,
-
[60]
IconQA: A new benchmark for abstract diagram under- standing and visual language reasoning
Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021. 4
-
[61]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. NIPS, 35:2507–2521, 2022. 4
work page 2022
-
[62]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual con- texts. arXiv preprint arXiv:2310.02255, 2023. 1, 2, 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
Gen Luo, Xue Yang, Wenhan Dou, Zhaokai Wang, Jifeng Dai, Yu Qiao, and Xizhou Zhu. Mono-internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training. arXiv preprint arXiv:2410.08202, 2024. 3
-
[64]
Ok-vqa: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, pages 3195–3204, 2019. 4
work page 2019
-
[65]
Chartqa: A benchmark for question answering about charts with visual and logical reasoning
Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In ACL, pages 2263–2279, 2022. 4
work page 2022
-
[66]
Docvqa: A dataset for vqa on document images
Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In WACV, pages 2200–2209, 2021. 4
work page 2021
-
[67]
Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. In WACV, pages 1697–1706, 2022. 4
work page 2022
-
[68]
Distributional preference alignment of llms via optimal transport
Igor Melnyk, Youssef Mroueh, Brian Belgodere, Mat- tia Rigotti, Apoorva Nitsure, Mikhail Yurochkin, Kristjan Greenewald, Jiri Navratil, and Jerret Ross. Distributional preference alignment of llms via optimal transport. arXiv preprint arXiv:2406.05882, 2024. 4, 7, 1
-
[69]
Ocr-vqa: Visual question answering by reading text in images
Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. InICDAR, pages 947–952, 2019. 4
work page 2019
-
[70]
A note on dpo with noisy preferences & re- lationship to ipo, 2023
Eric Mitchell. A note on dpo with noisy preferences & re- lationship to ipo, 2023. 4, 7, 1
work page 2023
-
[71]
Seva: Leveraging sketches to evaluate alignment between human and machine visual abstraction
Kushin Mukherjee, Holly Huey, Xuanchen Lu, Yael Vinker, Rio Aguina-Kang, Ariel Shamir, and Judith Fan. Seva: Leveraging sketches to evaluate alignment between human and machine visual abstraction. Advances in Neural Infor- mation Processing Systems, 36:67138–67155, 2023. 3, 6, 7
work page 2023
-
[72]
OpenAI. Gpt-4v(ision) system card. https://cdn. openai.com/papers/GPTV_System_Card.pdf ,
-
[73]
OpenAI. Gpt-4o system card. https://openai.com/ index/gpt-4o-system-card/ , 2024. 6
work page 2024
-
[74]
Training lan- guage models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training lan- guage models to follow instructions with human feedback. Advances in neural information processing systems , 35: 27730–27744, 2022. 3
work page 2022
-
[75]
Smaug: Fixing failure modes of preference optimisation with dpo-positive
Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, and Colin White. Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228, 2024. 4
-
[76]
Iter- ative reasoning preference optimization
Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. Iter- ative reasoning preference optimization. arXiv preprint arXiv:2404.19733, 2024. 2, 3
-
[77]
Strengthening multi- modal large language model with bootstrapped preference optimization
Renjie Pi, Tianyang Han, Wei Xiong, Jipeng Zhang, Run- tao Liu, Rui Pan, and Tong Zhang. Strengthening multi- modal large language model with bootstrapped preference optimization. arXiv preprint arXiv:2403.08730, 2024. 2
-
[78]
Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024. 5, 6
-
[79]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024. 2, 3, 4, 7, 1 11
work page 2024
-
[80]
Learning multiple visual domains with residual adapters
Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters. NIPS, 30, 2017. 3
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.