Recognition: unknown
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
Pith reviewed 2026-05-10 00:37 UTC · model grok-4.3
The pith
Current top vision-language models achieve only about 50% accuracy on Olympiad problems that require reasoning across multiple images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors create OMIBench, a benchmark of Olympiad-level problems from four scientific fields that require multi-image reasoning, complete with manually annotated rationales and evaluation methods for exact and semantic matching. They report that leading LVLMs, including Gemini-3-Pro, attain only about 50% performance, exposing gaps in current systems' ability to integrate distributed visual evidence.
What carries the argument
OMIBench, a dataset of multi-image Olympiad problems accompanied by annotated rationales and protocols for both exact and semantic answer evaluation.
If this is right
- Leading LVLMs show clear performance shortfalls on tasks needing evidence from multiple images.
- Gemini-3-Pro and similar models reach only approximately 50% accuracy on the benchmark.
- The benchmark provides tools for researchers to measure and improve multi-image reasoning capabilities.
- Evaluation can use either strict exact matching or more flexible semantic matching of answers.
Where Pith is reading between the lines
- Future model designs may need explicit mechanisms to link and reason over separate images rather than processing them independently.
- Existing single-image benchmarks could be overestimating real capabilities for complex, distributed visual tasks.
- Training data that splits related information across images might help close the observed gaps.
- Such benchmarks could prove useful in other fields involving multiple visual inputs, like interpreting sets of scientific figures.
Load-bearing premise
The chosen Olympiad problems together with their manually annotated rationales represent the true demands of multi-image reasoning in real Olympiad settings.
What would settle it
Finding that several state-of-the-art LVLMs achieve substantially higher than 50% accuracy on OMIBench, say above 75%, would cast doubt on the extent of the reported reasoning limitations.
Figures
read the original abstract
Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OMIBench, a benchmark for Olympiad-level multi-image reasoning in LVLMs drawn from biology, chemistry, mathematics, and physics problems. It supplies manually annotated rationales and protocols for exact and semantic answer matching. Experiments across multiple LVLMs report performance gaps, with the strongest model (Gemini-3-Pro) reaching only ~50% accuracy, and position the benchmark as a resource for studying distributed visual evidence in complex reasoning.
Significance. If the problems are shown to require cross-image integration, OMIBench would provide a useful diagnostic resource for an under-tested capability in current LVLMs. The manual rationales could support targeted error analysis, and the multi-domain coverage adds breadth. The work does not include machine-checked elements or parameter-free derivations but offers a concrete empirical testbed.
major comments (3)
- [§3] §3 (Benchmark Construction): No single-image or text-only ablations are reported, nor is there a count of problems solvable from any single image or explicit verification that rationales cite cross-image dependencies. This is load-bearing for interpreting the ~50% Gemini-3-Pro result as evidence of a multi-image reasoning gap rather than general Olympiad hardness.
- [§3.2] §3.2 (Annotation and Validation): The manuscript provides no inter-annotator agreement statistics, no protocol for confirming that annotations isolate multi-image requirements, and no details on data selection criteria to ensure problems cannot be solved without all images. These omissions weaken the central claim that OMIBench specifically measures multi-image reasoning.
- [§4] §4 (Experiments and Results): The reported model accuracies lack per-domain breakdowns, variance estimates, statistical significance tests, or analysis of failure modes tied to image distribution. Without these, the performance gaps cannot be rigorously attributed to the multi-image aspect highlighted in the abstract.
minor comments (2)
- The abstract and introduction could more precisely state the total number of problems, the distribution across domains, and the exact evaluation metrics used for semantic matching.
- Figure captions and table headers would benefit from explicit definitions of the exact vs. semantic matching protocols to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We agree that additional analyses are needed to more rigorously establish that OMIBench isolates multi-image reasoning capabilities. We address each major comment below and will incorporate the suggested revisions to strengthen the paper.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark Construction): No single-image or text-only ablations are reported, nor is there a count of problems solvable from any single image or explicit verification that rationales cite cross-image dependencies. This is load-bearing for interpreting the ~50% Gemini-3-Pro result as evidence of a multi-image reasoning gap rather than general Olympiad hardness.
Authors: We agree that ablations are essential to substantiate the multi-image focus. In the revised manuscript, we will add single-image and text-only baselines on a representative subset of problems. We will also report the number of problems that, per the annotated rationales, require evidence from multiple images and confirm that the rationales explicitly reference cross-image dependencies. These additions will help differentiate multi-image reasoning challenges from general Olympiad difficulty. revision: yes
-
Referee: [§3.2] §3.2 (Annotation and Validation): The manuscript provides no inter-annotator agreement statistics, no protocol for confirming that annotations isolate multi-image requirements, and no details on data selection criteria to ensure problems cannot be solved without all images. These omissions weaken the central claim that OMIBench specifically measures multi-image reasoning.
Authors: We acknowledge the need for greater transparency in the annotation process. Although the rationales were created and cross-checked by domain experts, we will include inter-annotator agreement statistics in the revision. We will also document the protocol used to verify that problems require all provided images and detail the selection criteria that excluded problems solvable from any single image or text alone. revision: yes
-
Referee: [§4] §4 (Experiments and Results): The reported model accuracies lack per-domain breakdowns, variance estimates, statistical significance tests, or analysis of failure modes tied to image distribution. Without these, the performance gaps cannot be rigorously attributed to the multi-image aspect highlighted in the abstract.
Authors: We will expand the experimental results to include per-domain accuracy breakdowns for all evaluated models. Where multiple runs are feasible, we will report variance estimates and apply statistical significance tests. We will further add a failure-mode analysis that categorizes errors according to whether they arise from difficulties in cross-image integration versus other reasoning limitations. revision: yes
Circularity Check
No circularity: benchmark construction is empirical and self-contained
full rationale
The paper creates OMIBench by curating Olympiad problems from biology, chemistry, mathematics, and physics, supplying manually annotated rationales, and reporting direct empirical accuracy of LVLMs (e.g., Gemini-3-Pro at ~50%). No equations, fitted parameters, predictions, or derivations exist that could reduce to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. The central claim is an observed performance gap on the new dataset, which is measured externally rather than derived from prior fitted quantities or self-referential definitions. This is the expected non-circular outcome for a benchmark paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Olympiad problems frequently require integrating information distributed across multiple images
invented entities (1)
-
OMIBench
no independent evidence
Reference graph
Works this paper leans on
-
[1]
American invitational mathematics examination (aime) aime 2024-i & ii, 2024
AIME. American invitational mathematics examination (aime) aime 2024-i & ii, 2024. URLhttps: //huggingface.co/datasets/Maxwell-Jia/AIME_2024
2024
-
[2]
American invitational mathematics examination (aime) 2025-i & ii, 2025
AIME. American invitational mathematics examination (aime) 2025-i & ii, 2025. URL https: //huggingface.co/datasets/opencompass/AIME2025
2025
-
[3]
Probing the limitations of multimodal language models for chemistry and materials research.Nature computational science, pages 1–10, 2025
Nawaf Alampara, Mara Schilling-Wilhelmi, Martiño Ríos-García, Indrajeet Mandal, Pranav Khetarpal, Hargun Singh Grover, NM Anoop Krishnan, and Kevin Maik Jablonka. Probing the limitations of multimodal language models for chemistry and materials research.Nature computational science, pages 1–10, 2025
2025
-
[4]
American mathematics competitions, 2023
AMC. American mathematics competitions, 2023. URLhttps://artofproblemsolving.com/ wiki/index.php/AMC_Problems_and_Solutions
2023
-
[5]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
An augmented benchmark dataset for geometric question answering through dual parallel text encoding
Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong ...
2022
-
[8]
Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning
Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 513–523, 2021
2021
-
[9]
Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression
JiaqiChen,TongLi,JinghuiQin,PanLu,LiangLin,ChongyuChen,andXiaodanLiang. Unigeo: Unifying geometrylogicalreasoningviareformulatingmathematicalexpression.arXivpreprintarXiv:2212.02746, 2022
-
[10]
Qiguang Chen, Libo Qin, Jiaqi Wang, Jinxuan Zhou, and Wanxiang Che. Unlocking the capabilities of thought: A reasoning boundary framework to quantify and optimize chain-of-thought.Advances in Neural Information Processing Systems, 37:54872–54904, 2024
2024
-
[11]
M 3CoT: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought
Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M 3CoT: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought. In Lun-Wei Ku, Andre Mar- tins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8199–8221, Bangkok, Th...
-
[12]
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567, 2025
work page internal anchor Pith review arXiv 2025
-
[13]
Qiguang Chen, Mingda Yang, Libo Qin, Jinhao Liu, Zheng Yan, Jiannan Guan, Dengyun Peng, Yiyan Ji, Hanjing Li, Mengkang Hu, et al. Ai4research: A survey of artificial intelligence for scientific research. arXiv preprint arXiv:2507.01903, 2025
-
[14]
Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan, Jiarui Guo, Minghao Liu, Jiaheng Liu, Tong Yang, Ge Zhang, et al. The molecular structure of thought: Mapping the topology of long chain-of-thought reasoning.arXiv preprint arXiv:2601.06002, 2026
-
[15]
CogFlow: Bridging perception and reasoning through knowledge internalization for visual mathematical problem solving
Shuhang Chen, Yunqiu Xu, Junjie Xie, Aojun Lu, Tao Feng, Zeying Huang, Ning Zhang, Yi Sun, Yi Yang, and Hangjie Yuan. CogFlow: Bridging perception and reasoning through knowledge internalization for visual mathematical problem solving. InInternational Conference on Learning Representations (ICLR), 2026
2026
-
[16]
Zihui Cheng, Qiguang Chen, Xiao Xu, Jiaqi Wang, Weiyun Wang, Hao Fei, Yidong Wang, Alex Jinpeng Wang, Zhi Chen, Wanxiang Che, et al. Visual thoughts: A unified perspective of understanding multimodal chain-of-thought.arXiv preprint arXiv:2505.15510, 2025. 13
-
[17]
Comt: A novel benchmark for chain of multi-modal thought on large vision-language models
Zihui Cheng, Qiguang Chen, Jin Zhang, Hao Fei, Xiaocheng Feng, Wanxiang Che, Min Li, and Libo Qin. Comt: A novel benchmark for chain of multi-modal thought on large vision-language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23678–23686, 2025
2025
-
[18]
Evaluating mllms with multimodal multi-image reasoning benchmark
Ziming Cheng, Binrui Xu, Lisheng Gong, Zuhe Song, Tianshuo Zhou, Shiqi Zhong, Siyu Ren, Mingxiang Chen, Xiangchao Meng, Yuxin Zhang, et al. Evaluating mllms with multimodal multi-image reasoning benchmark.arXiv preprint arXiv:2506.04280, 2025
-
[19]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
From easy to hard: The mir benchmark for progressive interleaved multi-image reasoning
Hang Du, Jiayang Zhang, Guoshun Nan, Wendi Deng, Zhenyan Chen, Chenyang Zhang, Wang Xiao, Shan Huang, Yuqi Pan, Tao Qi, et al. From easy to hard: The mir benchmark for progressive interleaved multi-image reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 859–869, 2025
2025
-
[21]
Blink: Multimodal large language models can see but not perceive
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024
2024
-
[22]
Gemini 3: Technical report
Google DeepMind. Gemini 3: Technical report. Technical report, 2025. https://deepmind. google/
2025
-
[23]
Can MLLMs reason in multimodality? EMMA: An enhanced multimodal reasoning benchmark
Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can MLLMs reason in multimodality? EMMA: An enhanced multimodal reasoning benchmark. In Forty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/ forum?id=v26vwjxOEz
2025
-
[24]
Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...
2024
-
[25]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review arXiv 2021
-
[26]
Chemistry race/chemiklánı: Team-based competition in chemistry.Journal of Chemical Education, 98(12):3878– 3883, 2021
Jan Hrubes, Adam Tywoniak, Martin Balouch, Stanislav Chvíla, and Jan Hrabovsky. Chemistry race/chemiklánı: Team-based competition in chemistry.Journal of Chemical Education, 98(12):3878– 3883, 2021
2021
-
[27]
Smith, and Ranjay Krishna
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[28]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 14
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Mpcc: A novel benchmark for multimodal planning with complex constraints in multimodal large language models
Yiyan Ji, Haoran Chen, Qiguang Chen, Chengyue Wu, Libo Qin, and Wanxiang Che. Mpcc: A novel benchmark for multimodal planning with complex constraints in multimodal large language models. InProceedings of the 33rd ACM International Conference on Multimedia, pages 5188–5197, 2025
2025
-
[30]
Mantis: Interleaved multi-image instruction tuning.Transactions on Machine Learning Research, 2024
Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URLhttps://openreview.net/forum?id=skLtdUVaJa
2024
-
[31]
MME-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency
Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, Bo Zhang, Chaoyou Fu, Peng Gao, and Hongsheng Li. MME-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency. In Forty-second International Conference on Machine Learning, 2025. URLh...
2025
-
[32]
Remi: A dataset for reasoning with multiple images.Advances in Neural Information Processing Systems, 37:60088–60109, 2024
Mehran Kazemi, Nishanth Dikkala, Ankit Anand, Petar Devic, Ishita Dasgupta, Fangyu Liu, Bahare Fatemi, Pranjal Awasthi, Sreenivas Gollapudi, Dee Guo, et al. Remi: A dataset for reasoning with multiple images.Advances in Neural Information Processing Systems, 37:60088–60109, 2024
2024
-
[33]
From System 1 to System 2: A Survey of Reasoning Large Language Models
Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419, 2025
work page internal anchor Pith review arXiv 2025
-
[34]
Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji Zhang, Fei Huang, Chunfeng Yuan, Bing Li, et al. Mibench: Evaluating multimodal large language models over multiple images. arXiv preprint arXiv:2407.15272, 2024
-
[35]
Mathematical language models: A survey.ACM Computing Surveys, 2025
Wentao Liu, Hanglei Hu, Jie Zhou, Yuyang Ding, Junsong Li, Jiayi Zeng, Mengliang He, Qin Chen, Bo Jiang, Aimin Zhou, et al. Mathematical language models: A survey.ACM Computing Surveys, 2025
2025
-
[36]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[37]
MMDU: A multi-turn multi-image dialog understanding bench- mark and instruction-tuning dataset for LVLMs
Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, and Jiaqi Wang. MMDU: A multi-turn multi-image dialog understanding bench- mark and instruction-tuning dataset for LVLMs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[38]
Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning.arXiv preprint arXiv:2105.04165, 2021
-
[39]
Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022
2022
-
[40]
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=KUNzEQMWU7. 15
2024
-
[41]
Deep learning methods for abstract visual reasoning: A survey on raven’s progressive matrices.ACM Computing Surveys, 57(7):1–36, 2025
Mikołaj Małkiński and Jacek Mańdziuk. Deep learning methods for abstract visual reasoning: A survey on raven’s progressive matrices.ACM Computing Surveys, 57(7):1–36, 2025
2025
-
[42]
Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, et al. Mmiu: Multimodal multi-image understanding for evaluating large vision-language models.arXiv preprint arXiv:2408.02718, 2024
-
[43]
Evaluating AI’s ability to perform scientific research tasks
OpenAI. Evaluating AI’s ability to perform scientific research tasks. OpenAI Blog, 2025.https: //openai.com/index/frontierscience/
2025
-
[44]
GPT-5 system card
OpenAI. GPT-5 system card. Technical report, 2025.https://openai.com/
2025
-
[45]
OpenAI o4-mini System Card
OpenAI. OpenAI o4-mini System Card. Technical report, 2025.https://openai.com/
2025
-
[46]
What factors affect multi-modal in-context learning? an in-depth exploration.Advances in Neural Information Processing Systems, 37: 123207–123236, 2024
Libo Qin, Qiguang Chen, Hao Fei, Zhi Chen, Min Li, and Wanxiang Che. What factors affect multi-modal in-context learning? an in-depth exploration.Advances in Neural Information Processing Systems, 37: 123207–123236, 2024
2024
-
[47]
Scifibench: Benchmarking large multimodal models for scientific figure interpretation.Advances in Neural Information Processing Systems, 37:18695–18728, 2024
Jonathan Roberts, Kai Han, Neil Houlsby, and Samuel Albanie. Scifibench: Benchmarking large multimodal models for scientific figure interpretation.Advances in Neural Information Processing Systems, 37:18695–18728, 2024
2024
-
[48]
Junhao Shen, Haiteng Zhao, Yuzhe Gu, Songyang Gao, Kuikun Liu, Haian Huang, Jianfei Gao, Dahua Lin, Wenwei Zhang, and Kai Chen. Semi-off-policy reinforcement learning for vision-language slow- thinking reasoning.arXiv preprint arXiv:2507.16814, 2025
-
[49]
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025
work page internal anchor Pith review arXiv 2025
-
[50]
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models
Haoxiang Sun, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Challenging the boundaries of reasoning: An olympiad-level math benchmark for large language models.arXiv preprint arXiv:2503.21380, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025
work page internal anchor Pith review arXiv 2025
-
[52]
Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025
Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, et al. Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025
-
[53]
Physics big, 2024
Zaharov Timur, Konstantin Korolev, and Aleksandr Nikolich. Physics big, 2024. URLhttps:// huggingface.co/datasets/Vikhrmodels/physics_big
2024
-
[54]
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, et al. Thinking with video: Video generation as a promising multimodal reasoning paradigm.arXiv preprint arXiv:2511.04570, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, et al. Srpo: Enhancing multimodal llm reasoning via reflection-aware reinforcement learning.arXiv preprint arXiv:2506.01713, 2025. 16
-
[56]
Fei Wang, Xingyu Fu, James Y. Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, Tianyi Lorena Yan, Wenjie Jacky Mo, Hsiang-Hui Liu, Pan Lu, Chun- yuan Li, Chaowei Xiao, Kai-Wei Chang, Dan Roth, Sheng Zhang, Hoifung Poon, and Muhao Chen. Muirbench: A comprehensive benchmark for robust multi-image understanding. InThe...
2025
-
[57]
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models
Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan- and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long...
-
[58]
Mv-math: Evaluating multimodal math reasoning in multi-visual contexts
Peijie Wang, Zhong-Zhi Li, Fei Yin, Dekang Ran, and Cheng-Lin Liu. Mv-math: Evaluating multimodal math reasoning in multi-visual contexts. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19541–19551, 2025
2025
-
[59]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review arXiv 2025
-
[60]
Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Fuxiao Liu, Gedas Bertasius, Mohit Bansal, Huaxiu Yao, and Furong Huang. Mementos: A comprehensive benchmark for multimodal large language model reasoning over image sequences. In Proceedings of the 62nd Annual Meeting of the Association for Computational Li...
-
[61]
Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025
Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025
-
[62]
Slow Perception: Let’s Perceive Geometric Figures Step-by-step,
Haoran Wei, Youyang Yin, Yumeng Li, Jia Wang, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang, and Daxin Jiang. Slow perception: Let’s perceive geometric figures step-by-step.arXiv preprint arXiv:2412.20631, 2024
-
[63]
Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022
2022
-
[64]
Mind’s eye of llms: visualization-of-thought elicits spatial reasoning in large language models.Advances in Neural Information Processing Systems, 37:90277–90317, 2024
Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei. Mind’s eye of llms: visualization-of-thought elicits spatial reasoning in large language models.Advances in Neural Information Processing Systems, 37:90277–90317, 2024
2024
-
[65]
MC-Bench: A benchmark for multi-context visual grounding in the era of MLLMs
Yunqiu Xu, Linchao Zhu, and Yi Yang. MC-Bench: A benchmark for multi-context visual grounding in the era of MLLMs. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025
2025
-
[66]
Huanjin Yao, Jiaxing Huang, Yawen Qiu, Michael K Chen, Wenzheng Liu, Wei Zhang, Wenjie Zeng, Xikun Zhang, Jingyi Zhang, Yuxin Song, et al. Mmreason: An open-ended multi-modal multi-step reasoning benchmark for mllms toward agi.arXiv preprint arXiv:2506.23563, 2025. 17
-
[67]
Fangchen Yu, Haiyuan Wan, Qianjia Cheng, Yuchen Zhang, Jiacheng Chen, Fujun Han, Yulun Wu, Junchi Yao, Ruilizhen Hu, Ning Ding, et al. Hipho: How far are (m) llms from humans in the latest high school physics olympiad benchmark?arXiv preprint arXiv:2509.07894, 2025
-
[68]
Mmmu: Amassive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, ZhenzhuYang, YiboLiu, WenhaoHuang, HuanSun, YuSu, andWenhuChen. Mmmu: Amassive multi-discipline multimodal understanding and reasoning benchmark for expert agi....
2024
-
[69]
Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025
Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, et al. Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025
-
[70]
Yuheng Zha, Kun Zhou, Yujia Wu, Yushu Wang, Jie Feng, Zhi Xu, Shibo Hao, Zhengzhong Liu, Eric P Xing, and Zhiting Hu. Vision-g1: Towards general vision language reasoning with multi-domain data curation.arXiv preprint arXiv:2508.12680, 2025
-
[71]
CMMCoT: Enhancing complex multi-image comprehension via multi-modal chain-of-thought and memory augmentation
Guanghao Zhang, Tao Zhong, Yan Xia, Mushui Liu, Zhelun Yu, Haoyuan Li, Wanggui He, Fangxun Shu, Dong She, Yi Wang, and Hao Jiang. CMMCoT: Enhancing complex multi-image comprehension via multi-modal chain-of-thought and memory augmentation. InProceedings of the AAAI Conference on Artificial Intelligence, 2026
2026
-
[72]
Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186
Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186. Springer, 2024
2024
-
[73]
Yuan Zhang, Ming Lu, Junwen Pan, Tao Huang, Kuan Cheng, Qi She, and Shanghang Zhang. Chainv: Atomic visual hints make multimodal reasoning shorter and better.arXiv preprint arXiv:2511.17106, 2025
-
[74]
Multimodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024
Zhuosheng Zhang, Aston Zhang, Mu Li, hai zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URLhttps://openreview.net/forum?id=y1pPWFVfvR
2024
-
[75]
Bingchen Zhao, Yongshuo Zong, Letian Zhang, and Timothy Hospedales. Benchmarking multi-image understanding in vision and language models: Perception, knowledge, reasoning, and multi-hop reasoning.arXiv preprint arXiv:2406.12742, 2024
-
[76]
Agieval: A human-centric benchmark for evaluating foundation models
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 2299–2314, 2024
2024
-
[77]
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H. Chi. Least-to-most prompting enables complex reasoninginlargelanguagemodels. InTheEleventhInternationalConferenceonLearningRepresentations,
-
[78]
URLhttps://openreview.net/forum?id=WZH7099tgfM
-
[79]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 18 Appendix A. Data Construction Details OMIBench was constructed via a rigorous multi-stage pipeli...
work page internal anchor Pith review arXiv 2025
-
[80]
•Slightly unclear transitions between steps
Acceptwithminoredits.Usethisoptionwhentherationaleisfundamentallycorrectandcomplete, but has small issues such as: •Minor wording problems (e.g., awkward phrasing, ambiguous pronouns). •Slightly unclear transitions between steps. •Cosmetic inconsistencies in notation, symbols, or formatting. In this case, annotators should directly edit the text to correc...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.