arxiv: 2604.20806 · v1 · submitted 2026-04-22 · 💻 cs.CV · cs.AI· cs.CL

Recognition: unknown

OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

Qiguang Chen , Chengyu Luan , Jiajun Wu , Qiming Yu , Yi Yang , Yizhuo Li , Jingqi Tong , Xiachong Feng

show 2 more authors

Libo Qin Wanxiang Che

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords multi-image reasoningOlympiad benchmarklarge vision-language modelsmultimodal reasoningbenchmark evaluationscientific OlympiadsLVLM performance

0 comments

The pith

Current top vision-language models achieve only about 50% accuracy on Olympiad problems that require reasoning across multiple images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OMIBench as a benchmark to test how well large vision-language models handle Olympiad problems where information must be gathered from several images. It covers problems in biology, chemistry, mathematics, and physics, each supplied with human-written rationales for the solutions. Tests reveal that even the best models reach only around 50 percent correct answers under these conditions. The benchmark also includes clear protocols for scoring answers either by exact match or by semantic similarity. This setup helps identify and address weaknesses in combining visual details from multiple sources.

Core claim

The authors create OMIBench, a benchmark of Olympiad-level problems from four scientific fields that require multi-image reasoning, complete with manually annotated rationales and evaluation methods for exact and semantic matching. They report that leading LVLMs, including Gemini-3-Pro, attain only about 50% performance, exposing gaps in current systems' ability to integrate distributed visual evidence.

What carries the argument

OMIBench, a dataset of multi-image Olympiad problems accompanied by annotated rationales and protocols for both exact and semantic answer evaluation.

If this is right

Leading LVLMs show clear performance shortfalls on tasks needing evidence from multiple images.
Gemini-3-Pro and similar models reach only approximately 50% accuracy on the benchmark.
The benchmark provides tools for researchers to measure and improve multi-image reasoning capabilities.
Evaluation can use either strict exact matching or more flexible semantic matching of answers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future model designs may need explicit mechanisms to link and reason over separate images rather than processing them independently.
Existing single-image benchmarks could be overestimating real capabilities for complex, distributed visual tasks.
Training data that splits related information across images might help close the observed gaps.
Such benchmarks could prove useful in other fields involving multiple visual inputs, like interpreting sets of scientific figures.

Load-bearing premise

The chosen Olympiad problems together with their manually annotated rationales represent the true demands of multi-image reasoning in real Olympiad settings.

What would settle it

Finding that several state-of-the-art LVLMs achieve substantially higher than 50% accuracy on OMIBench, say above 75%, would cast doubt on the extent of the reported reasoning limitations.

Figures

Figures reproduced from arXiv: 2604.20806 by Chengyu Luan, Jiajun Wu, Jingqi Tong, Libo Qin, Qiguang Chen, Qiming Yu, Wanxiang Che, Xiachong Feng, Yi Yang, Yizhuo Li.

**Figure 2.** Figure 2: , it hallucinated a non-existent theorem based on visual similarity to [PITH_FULL_IMAGE:figures/full_fig_p031_2.png] view at source ↗

**Figure 4.** Figure 4: Fig4 [PITH_FULL_IMAGE:figures/full_fig_p031_4.png] view at source ↗

**Figure 1.** Figure 1: Cofactor Structures Fig2:Reaction Scheme (B1 -> B5/B6) [PITH_FULL_IMAGE:figures/full_fig_p032_1.png] view at source ↗

**Figure 2.** Figure 2: Fig2:Reaction Scheme (B1 [PITH_FULL_IMAGE:figures/full_fig_p032_2.png] view at source ↗

**Figure 1.** Figure 1: Historical Context (Rabbits at Waterhole) [PITH_FULL_IMAGE:figures/full_fig_p033_1.png] view at source ↗

**Figure 1.** Figure 1: Growth Timeline & Primordia Count [PITH_FULL_IMAGE:figures/full_fig_p044_1.png] view at source ↗

**Figure 2.** Figure 2: Time Data (When) [PITH_FULL_IMAGE:figures/full_fig_p045_2.png] view at source ↗

**Figure 1.** Figure 1: General Hydrolysis [PITH_FULL_IMAGE:figures/full_fig_p046_1.png] view at source ↗

**Figure 1.** Figure 1: Geometric [PITH_FULL_IMAGE:figures/full_fig_p047_1.png] view at source ↗

read the original abstract

Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OMIBench introduces a multi-image Olympiad benchmark and shows top LVLMs at roughly 50%, but lacks ablations to confirm the tasks actually require cross-image information.

read the letter

The paper's main contribution is OMIBench, a collection of Olympiad problems from several fields where the solution depends on information spread across multiple images. They include manually annotated rationales and test a range of LVLMs, with the top one, Gemini-3-Pro, reaching only about 50 percent. This highlights a gap in current models for handling distributed visual evidence. It does a solid job of extending beyond single-image benchmarks by focusing on real Olympiad material in biology, chemistry, mathematics, and physics. The addition of evaluation protocols for both exact matches and semantic similarity is practical for these kinds of open-ended answers. The reported performance gaps are useful as a starting point for seeing where models struggle. The soft spot is in validating the core premise. Without single-image ablations or counts of problems solvable from one image, we can't be sure the low scores stem from multi-image reasoning rather than general problem difficulty. The setup does not detail how they confirmed cross-image dependencies in the annotations, which leaves the central claim a bit under-supported. If the full paper has those experiments, they are not prominent enough. This work is aimed at people developing or benchmarking large vision-language models, particularly those interested in advanced reasoning tasks. Anyone looking for tougher multimodal tests would find the dataset and baselines helpful. It deserves peer review because the benchmark idea is timely and could influence how we evaluate these models going forward. I would recommend sending it to referees, but with feedback to strengthen the validation of the multi-image requirement through additional controls.

Referee Report

3 major / 2 minor

Summary. The paper introduces OMIBench, a benchmark for Olympiad-level multi-image reasoning in LVLMs drawn from biology, chemistry, mathematics, and physics problems. It supplies manually annotated rationales and protocols for exact and semantic answer matching. Experiments across multiple LVLMs report performance gaps, with the strongest model (Gemini-3-Pro) reaching only ~50% accuracy, and position the benchmark as a resource for studying distributed visual evidence in complex reasoning.

Significance. If the problems are shown to require cross-image integration, OMIBench would provide a useful diagnostic resource for an under-tested capability in current LVLMs. The manual rationales could support targeted error analysis, and the multi-domain coverage adds breadth. The work does not include machine-checked elements or parameter-free derivations but offers a concrete empirical testbed.

major comments (3)

[§3] §3 (Benchmark Construction): No single-image or text-only ablations are reported, nor is there a count of problems solvable from any single image or explicit verification that rationales cite cross-image dependencies. This is load-bearing for interpreting the ~50% Gemini-3-Pro result as evidence of a multi-image reasoning gap rather than general Olympiad hardness.
[§3.2] §3.2 (Annotation and Validation): The manuscript provides no inter-annotator agreement statistics, no protocol for confirming that annotations isolate multi-image requirements, and no details on data selection criteria to ensure problems cannot be solved without all images. These omissions weaken the central claim that OMIBench specifically measures multi-image reasoning.
[§4] §4 (Experiments and Results): The reported model accuracies lack per-domain breakdowns, variance estimates, statistical significance tests, or analysis of failure modes tied to image distribution. Without these, the performance gaps cannot be rigorously attributed to the multi-image aspect highlighted in the abstract.

minor comments (2)

The abstract and introduction could more precisely state the total number of problems, the distribution across domains, and the exact evaluation metrics used for semantic matching.
Figure captions and table headers would benefit from explicit definitions of the exact vs. semantic matching protocols to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We agree that additional analyses are needed to more rigorously establish that OMIBench isolates multi-image reasoning capabilities. We address each major comment below and will incorporate the suggested revisions to strengthen the paper.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): No single-image or text-only ablations are reported, nor is there a count of problems solvable from any single image or explicit verification that rationales cite cross-image dependencies. This is load-bearing for interpreting the ~50% Gemini-3-Pro result as evidence of a multi-image reasoning gap rather than general Olympiad hardness.

Authors: We agree that ablations are essential to substantiate the multi-image focus. In the revised manuscript, we will add single-image and text-only baselines on a representative subset of problems. We will also report the number of problems that, per the annotated rationales, require evidence from multiple images and confirm that the rationales explicitly reference cross-image dependencies. These additions will help differentiate multi-image reasoning challenges from general Olympiad difficulty. revision: yes
Referee: [§3.2] §3.2 (Annotation and Validation): The manuscript provides no inter-annotator agreement statistics, no protocol for confirming that annotations isolate multi-image requirements, and no details on data selection criteria to ensure problems cannot be solved without all images. These omissions weaken the central claim that OMIBench specifically measures multi-image reasoning.

Authors: We acknowledge the need for greater transparency in the annotation process. Although the rationales were created and cross-checked by domain experts, we will include inter-annotator agreement statistics in the revision. We will also document the protocol used to verify that problems require all provided images and detail the selection criteria that excluded problems solvable from any single image or text alone. revision: yes
Referee: [§4] §4 (Experiments and Results): The reported model accuracies lack per-domain breakdowns, variance estimates, statistical significance tests, or analysis of failure modes tied to image distribution. Without these, the performance gaps cannot be rigorously attributed to the multi-image aspect highlighted in the abstract.

Authors: We will expand the experimental results to include per-domain accuracy breakdowns for all evaluated models. Where multiple runs are feasible, we will report variance estimates and apply statistical significance tests. We will further add a failure-mode analysis that categorizes errors according to whether they arise from difficulties in cross-image integration versus other reasoning limitations. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction is empirical and self-contained

full rationale

The paper creates OMIBench by curating Olympiad problems from biology, chemistry, mathematics, and physics, supplying manually annotated rationales, and reporting direct empirical accuracy of LVLMs (e.g., Gemini-3-Pro at ~50%). No equations, fitted parameters, predictions, or derivations exist that could reduce to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. The central claim is an observed performance gap on the new dataset, which is measured externally rather than derived from prior fitted quantities or self-referential definitions. This is the expected non-circular outcome for a benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the curated problems test genuine multi-image reasoning and that the evaluation protocols are reliable; no free parameters or invented physical entities are involved.

axioms (1)

domain assumption Olympiad problems frequently require integrating information distributed across multiple images
Invoked in the motivation and design of the benchmark as stated in the abstract.

invented entities (1)

OMIBench no independent evidence
purpose: Dataset and evaluation framework for multi-image Olympiad reasoning
Newly introduced benchmark without external independent validation mentioned in the abstract.

pith-pipeline@v0.9.0 · 5483 in / 1173 out tokens · 32574 ms · 2026-05-10T00:37:27.669492+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

99 extracted references · 37 canonical work pages · 14 internal anchors

[1]

American invitational mathematics examination (aime) aime 2024-i & ii, 2024

AIME. American invitational mathematics examination (aime) aime 2024-i & ii, 2024. URLhttps: //huggingface.co/datasets/Maxwell-Jia/AIME_2024

2024
[2]

American invitational mathematics examination (aime) 2025-i & ii, 2025

AIME. American invitational mathematics examination (aime) 2025-i & ii, 2025. URL https: //huggingface.co/datasets/opencompass/AIME2025

2025
[3]

Probing the limitations of multimodal language models for chemistry and materials research.Nature computational science, pages 1–10, 2025

Nawaf Alampara, Mara Schilling-Wilhelmi, Martiño Ríos-García, Indrajeet Mandal, Pranav Khetarpal, Hargun Singh Grover, NM Anoop Krishnan, and Kevin Maik Jablonka. Probing the limitations of multimodal language models for chemistry and materials research.Nature computational science, pages 1–10, 2025

2025
[4]

American mathematics competitions, 2023

AMC. American mathematics competitions, 2023. URLhttps://artofproblemsolving.com/ wiki/index.php/AMC_Problems_and_Solutions

2023
[5]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

An augmented benchmark dataset for geometric question answering through dual parallel text encoding

Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong ...

2022
[8]

Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning

Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 513–523, 2021

2021
[9]

Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression

JiaqiChen,TongLi,JinghuiQin,PanLu,LiangLin,ChongyuChen,andXiaodanLiang. Unigeo: Unifying geometrylogicalreasoningviareformulatingmathematicalexpression.arXivpreprintarXiv:2212.02746, 2022

work page arXiv 2022
[10]

Qiguang Chen, Libo Qin, Jiaqi Wang, Jinxuan Zhou, and Wanxiang Che. Unlocking the capabilities of thought: A reasoning boundary framework to quantify and optimize chain-of-thought.Advances in Neural Information Processing Systems, 37:54872–54904, 2024

2024
[11]

M 3CoT: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought

Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M 3CoT: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought. In Lun-Wei Ku, Andre Mar- tins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8199–8221, Bangkok, Th...

work page doi:10.18653/v1/2024.acl-long.446 2024
[12]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567, 2025

work page internal anchor Pith review arXiv 2025
[13]

Ai4research: A survey of artificial intelligence for scientific research.arXiv preprint arXiv:2507.01903, 2025

Qiguang Chen, Mingda Yang, Libo Qin, Jinhao Liu, Zheng Yan, Jiannan Guan, Dengyun Peng, Yiyan Ji, Hanjing Li, Mengkang Hu, et al. Ai4research: A survey of artificial intelligence for scientific research. arXiv preprint arXiv:2507.01903, 2025

work page arXiv 2025
[14]

Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan, Jiarui Guo, Minghao Liu, Jiaheng Liu, Tong Yang, Ge Zhang, et al

Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan, Jiarui Guo, Minghao Liu, Jiaheng Liu, Tong Yang, Ge Zhang, et al. The molecular structure of thought: Mapping the topology of long chain-of-thought reasoning.arXiv preprint arXiv:2601.06002, 2026

work page arXiv 2026
[15]

CogFlow: Bridging perception and reasoning through knowledge internalization for visual mathematical problem solving

Shuhang Chen, Yunqiu Xu, Junjie Xie, Aojun Lu, Tao Feng, Zeying Huang, Ning Zhang, Yi Sun, Yi Yang, and Hangjie Yuan. CogFlow: Bridging perception and reasoning through knowledge internalization for visual mathematical problem solving. InInternational Conference on Learning Representations (ICLR), 2026

2026
[16]

Visual thoughts: A unified perspective of understanding multimodal chain-of-thought.arXiv preprint arXiv:2505.15510, 2025

Zihui Cheng, Qiguang Chen, Xiao Xu, Jiaqi Wang, Weiyun Wang, Hao Fei, Yidong Wang, Alex Jinpeng Wang, Zhi Chen, Wanxiang Che, et al. Visual thoughts: A unified perspective of understanding multimodal chain-of-thought.arXiv preprint arXiv:2505.15510, 2025. 13

work page arXiv 2025
[17]

Comt: A novel benchmark for chain of multi-modal thought on large vision-language models

Zihui Cheng, Qiguang Chen, Jin Zhang, Hao Fei, Xiaocheng Feng, Wanxiang Che, Min Li, and Libo Qin. Comt: A novel benchmark for chain of multi-modal thought on large vision-language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23678–23686, 2025

2025
[18]

Evaluating mllms with multimodal multi-image reasoning benchmark

Ziming Cheng, Binrui Xu, Lisheng Gong, Zuhe Song, Tianshuo Zhou, Shiqi Zhong, Siyu Ren, Mingxiang Chen, Xiangchao Meng, Yuxin Zhang, et al. Evaluating mllms with multimodal multi-image reasoning benchmark.arXiv preprint arXiv:2506.04280, 2025

work page arXiv 2025
[19]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

From easy to hard: The mir benchmark for progressive interleaved multi-image reasoning

Hang Du, Jiayang Zhang, Guoshun Nan, Wendi Deng, Zhenyan Chen, Chenyang Zhang, Wang Xiao, Shan Huang, Yuqi Pan, Tao Qi, et al. From easy to hard: The mir benchmark for progressive interleaved multi-image reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 859–869, 2025

2025
[21]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024

2024
[22]

Gemini 3: Technical report

Google DeepMind. Gemini 3: Technical report. Technical report, 2025. https://deepmind. google/

2025
[23]

Can MLLMs reason in multimodality? EMMA: An enhanced multimodal reasoning benchmark

Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can MLLMs reason in multimodality? EMMA: An enhanced multimodal reasoning benchmark. In Forty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/ forum?id=v26vwjxOEz

2025
[24]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

2024
[25]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review arXiv 2021
[26]

Chemistry race/chemiklánı: Team-based competition in chemistry.Journal of Chemical Education, 98(12):3878– 3883, 2021

Jan Hrubes, Adam Tywoniak, Martin Balouch, Stanislav Chvíla, and Jan Hrabovsky. Chemistry race/chemiklánı: Team-based competition in chemistry.Journal of Chemical Education, 98(12):3878– 3883, 2021

2021
[27]

Smith, and Ranjay Krishna

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[28]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 14

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Mpcc: A novel benchmark for multimodal planning with complex constraints in multimodal large language models

Yiyan Ji, Haoran Chen, Qiguang Chen, Chengyue Wu, Libo Qin, and Wanxiang Che. Mpcc: A novel benchmark for multimodal planning with complex constraints in multimodal large language models. InProceedings of the 33rd ACM International Conference on Multimedia, pages 5188–5197, 2025

2025
[30]

Mantis: Interleaved multi-image instruction tuning.Transactions on Machine Learning Research, 2024

Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URLhttps://openreview.net/forum?id=skLtdUVaJa

2024
[31]

MME-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency

Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, Bo Zhang, Chaoyou Fu, Peng Gao, and Hongsheng Li. MME-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency. In Forty-second International Conference on Machine Learning, 2025. URLh...

2025
[32]

Remi: A dataset for reasoning with multiple images.Advances in Neural Information Processing Systems, 37:60088–60109, 2024

Mehran Kazemi, Nishanth Dikkala, Ankit Anand, Petar Devic, Ishita Dasgupta, Fangyu Liu, Bahare Fatemi, Pranjal Awasthi, Sreenivas Gollapudi, Dee Guo, et al. Remi: A dataset for reasoning with multiple images.Advances in Neural Information Processing Systems, 37:60088–60109, 2024

2024
[33]

From System 1 to System 2: A Survey of Reasoning Large Language Models

Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419, 2025

work page internal anchor Pith review arXiv 2025
[34]

Mibench: Evaluating multimodal large language models over multi- ple images.arXiv preprint arXiv:2407.15272, 2024a

Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji Zhang, Fei Huang, Chunfeng Yuan, Bing Li, et al. Mibench: Evaluating multimodal large language models over multiple images. arXiv preprint arXiv:2407.15272, 2024

work page arXiv 2024
[35]

Mathematical language models: A survey.ACM Computing Surveys, 2025

Wentao Liu, Hanglei Hu, Jie Zhou, Yuyang Ding, Junsong Li, Jiayi Zeng, Mengliang He, Qin Chen, Bo Jiang, Aimin Zhou, et al. Mathematical language models: A survey.ACM Computing Surveys, 2025

2025
[36]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[37]

MMDU: A multi-turn multi-image dialog understanding bench- mark and instruction-tuning dataset for LVLMs

Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, and Jiaqi Wang. MMDU: A multi-turn multi-image dialog understanding bench- mark and instruction-tuning dataset for LVLMs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[38]

Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning.arXiv preprint arXiv:2105.04165, 2021

Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning.arXiv preprint arXiv:2105.04165, 2021

work page arXiv 2021
[39]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022

2022
[40]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=KUNzEQMWU7. 15

2024
[41]

Deep learning methods for abstract visual reasoning: A survey on raven’s progressive matrices.ACM Computing Surveys, 57(7):1–36, 2025

Mikołaj Małkiński and Jacek Mańdziuk. Deep learning methods for abstract visual reasoning: A survey on raven’s progressive matrices.ACM Computing Surveys, 57(7):1–36, 2025

2025
[42]

Mmiu: Multimodal multi-image understanding for evaluating large vision-language models.arXiv preprint arXiv:2408.02718, 2024

Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, et al. Mmiu: Multimodal multi-image understanding for evaluating large vision-language models.arXiv preprint arXiv:2408.02718, 2024

work page arXiv 2024
[43]

Evaluating AI’s ability to perform scientific research tasks

OpenAI. Evaluating AI’s ability to perform scientific research tasks. OpenAI Blog, 2025.https: //openai.com/index/frontierscience/

2025
[44]

GPT-5 system card

OpenAI. GPT-5 system card. Technical report, 2025.https://openai.com/

2025
[45]

OpenAI o4-mini System Card

OpenAI. OpenAI o4-mini System Card. Technical report, 2025.https://openai.com/

2025
[46]

What factors affect multi-modal in-context learning? an in-depth exploration.Advances in Neural Information Processing Systems, 37: 123207–123236, 2024

Libo Qin, Qiguang Chen, Hao Fei, Zhi Chen, Min Li, and Wanxiang Che. What factors affect multi-modal in-context learning? an in-depth exploration.Advances in Neural Information Processing Systems, 37: 123207–123236, 2024

2024
[47]

Scifibench: Benchmarking large multimodal models for scientific figure interpretation.Advances in Neural Information Processing Systems, 37:18695–18728, 2024

Jonathan Roberts, Kai Han, Neil Houlsby, and Samuel Albanie. Scifibench: Benchmarking large multimodal models for scientific figure interpretation.Advances in Neural Information Processing Systems, 37:18695–18728, 2024

2024
[48]

Semi-off-policy reinforcement learning for vision-language slow- thinking reasoning.arXiv preprint arXiv:2507.16814, 2025

Junhao Shen, Haiteng Zhao, Yuzhe Gu, Songyang Gao, Kuikun Liu, Haian Huang, Jianfei Gao, Dahua Lin, Wenwei Zhang, and Kai Chen. Semi-off-policy reinforcement learning for vision-language slow- thinking reasoning.arXiv preprint arXiv:2507.16814, 2025

work page arXiv 2025
[49]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025

work page internal anchor Pith review arXiv 2025
[50]

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

Haoxiang Sun, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Challenging the boundaries of reasoning: An olympiad-level math benchmark for large language models.arXiv preprint arXiv:2503.21380, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review arXiv 2025
[52]

Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025

Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, et al. Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025

work page arXiv 2025
[53]

Physics big, 2024

Zaharov Timur, Konstantin Korolev, and Aleksandr Nikolich. Physics big, 2024. URLhttps:// huggingface.co/datasets/Vikhrmodels/physics_big

2024
[54]

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, et al. Thinking with video: Video generation as a promising multimodal reasoning paradigm.arXiv preprint arXiv:2511.04570, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Srpo: Enhancing multimodal llm reasoning via reflection-aware reinforcement learning.arXiv preprint arXiv:2506.01713, 2025

Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, et al. Srpo: Enhancing multimodal llm reasoning via reflection-aware reinforcement learning.arXiv preprint arXiv:2506.01713, 2025. 16

work page arXiv 2025
[56]

Fei Wang, Xingyu Fu, James Y. Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, Tianyi Lorena Yan, Wenjie Jacky Mo, Hsiang-Hui Liu, Pan Lu, Chun- yuan Li, Chaowei Xiao, Kai-Wei Chang, Dan Roth, Sheng Zhang, Hoifung Poon, and Muhao Chen. Muirbench: A comprehensive benchmark for robust multi-image understanding. InThe...

2025
[57]

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan- and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long...

work page doi:10.18653/v1/2023.acl-long.147 2023
[58]

Mv-math: Evaluating multimodal math reasoning in multi-visual contexts

Peijie Wang, Zhong-Zhi Li, Fei Yin, Dekang Ran, and Cheng-Lin Liu. Mv-math: Evaluating multimodal math reasoning in multi-visual contexts. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19541–19551, 2025

2025
[59]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review arXiv 2025
[60]

Mementos: A comprehensive benchmark for multimodal large language model reasoning over image sequences

Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Fuxiao Liu, Gedas Bertasius, Mohit Bansal, Huaxiu Yao, and Furong Huang. Mementos: A comprehensive benchmark for multimodal large language model reasoning over image sequences. In Proceedings of the 62nd Annual Meeting of the Association for Computational Li...

work page doi:10.18653/v1/2024.acl-long.25 2024
[61]

Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025

Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025

work page arXiv 2025
[62]

Slow Perception: Let’s Perceive Geometric Figures Step-by-step,

Haoran Wei, Youyang Yin, Yumeng Li, Jia Wang, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang, and Daxin Jiang. Slow perception: Let’s perceive geometric figures step-by-step.arXiv preprint arXiv:2412.20631, 2024

work page arXiv 2024
[63]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

2022
[64]

Mind’s eye of llms: visualization-of-thought elicits spatial reasoning in large language models.Advances in Neural Information Processing Systems, 37:90277–90317, 2024

Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei. Mind’s eye of llms: visualization-of-thought elicits spatial reasoning in large language models.Advances in Neural Information Processing Systems, 37:90277–90317, 2024

2024
[65]

MC-Bench: A benchmark for multi-context visual grounding in the era of MLLMs

Yunqiu Xu, Linchao Zhu, and Yi Yang. MC-Bench: A benchmark for multi-context visual grounding in the era of MLLMs. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025
[66]

Mmreason: An open-ended multi-modal multi-step reasoning benchmark for mllms toward agi.arXiv preprint arXiv:2506.23563, 2025

Huanjin Yao, Jiaxing Huang, Yawen Qiu, Michael K Chen, Wenzheng Liu, Wei Zhang, Wenjie Zeng, Xikun Zhang, Jingyi Zhang, Yuxin Song, et al. Mmreason: An open-ended multi-modal multi-step reasoning benchmark for mllms toward agi.arXiv preprint arXiv:2506.23563, 2025. 17

work page arXiv 2025
[67]

Hipho: How far are (m) llms from hu- mans in the latest high school physics olympiad benchmark? arXiv preprint arXiv:2509.07894, 2025

Fangchen Yu, Haiyuan Wan, Qianjia Cheng, Yuchen Zhang, Jiacheng Chen, Fujun Han, Yulun Wu, Junchi Yao, Ruilizhen Hu, Ning Ding, et al. Hipho: How far are (m) llms from humans in the latest high school physics olympiad benchmark?arXiv preprint arXiv:2509.07894, 2025

work page arXiv 2025
[68]

Mmmu: Amassive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, ZhenzhuYang, YiboLiu, WenhaoHuang, HuanSun, YuSu, andWenhuChen. Mmmu: Amassive multi-discipline multimodal understanding and reasoning benchmark for expert agi....

2024
[69]

Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025

Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, et al. Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025

work page arXiv 2025
[70]

Vision-g1: Towards general vision language reasoning with multi-domain data curation.arXiv preprint arXiv:2508.12680, 2025

Yuheng Zha, Kun Zhou, Yujia Wu, Yushu Wang, Jie Feng, Zhi Xu, Shibo Hao, Zhengzhong Liu, Eric P Xing, and Zhiting Hu. Vision-g1: Towards general vision language reasoning with multi-domain data curation.arXiv preprint arXiv:2508.12680, 2025

work page arXiv 2025
[71]

CMMCoT: Enhancing complex multi-image comprehension via multi-modal chain-of-thought and memory augmentation

Guanghao Zhang, Tao Zhong, Yan Xia, Mushui Liu, Zhelun Yu, Haoyuan Li, Wanggui He, Fangxun Shu, Dong She, Yi Wang, and Hao Jiang. CMMCoT: Enhancing complex multi-image comprehension via multi-modal chain-of-thought and memory augmentation. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

2026
[72]

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186. Springer, 2024

2024
[73]

Chainv: Atomic visual hints make multimodal reasoning shorter and better.arXiv preprint arXiv:2511.17106, 2025

Yuan Zhang, Ming Lu, Junwen Pan, Tao Huang, Kuan Cheng, Qi She, and Shanghang Zhang. Chainv: Atomic visual hints make multimodal reasoning shorter and better.arXiv preprint arXiv:2511.17106, 2025

work page arXiv 2025
[74]

Multimodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024

Zhuosheng Zhang, Aston Zhang, Mu Li, hai zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URLhttps://openreview.net/forum?id=y1pPWFVfvR

2024
[75]

Benchmarking multi-image understanding in vision and language models: Perception, knowledge, reasoning, and multi-hop reasoning.arXiv preprint arXiv:2406.12742, 2024

Bingchen Zhao, Yongshuo Zong, Letian Zhang, and Timothy Hospedales. Benchmarking multi-image understanding in vision and language models: Perception, knowledge, reasoning, and multi-hop reasoning.arXiv preprint arXiv:2406.12742, 2024

work page arXiv 2024
[76]

Agieval: A human-centric benchmark for evaluating foundation models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 2299–2314, 2024

2024
[77]

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H. Chi. Least-to-most prompting enables complex reasoninginlargelanguagemodels. InTheEleventhInternationalConferenceonLearningRepresentations,
[78]

URLhttps://openreview.net/forum?id=WZH7099tgfM
[79]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 18 Appendix A. Data Construction Details OMIBench was constructed via a rigorous multi-stage pipeli...

work page internal anchor Pith review arXiv 2025
[80]

•Slightly unclear transitions between steps

Acceptwithminoredits.Usethisoptionwhentherationaleisfundamentallycorrectandcomplete, but has small issues such as: •Minor wording problems (e.g., awkward phrasing, ambiguous pronouns). •Slightly unclear transitions between steps. •Cosmetic inconsistencies in notation, symbols, or formatting. In this case, annotators should directly edit the text to correc...

Showing first 80 references.