pith. sign in

arxiv: 2605.26520 · v1 · pith:UGZKVVX6new · submitted 2026-05-26 · 💻 cs.CV · cs.AI

InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

Pith reviewed 2026-06-29 17:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords interleaved reasoningvisual-textual chain-of-thoughtself-correcting visual sketchesstepwise rewardvision-language modelslong-horizon visual reasoning
0
0 comments X

The pith

InterSketch interleaves self-correcting visual sketches with text reasoning and uses stepwise rewards for long-horizon visual tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InterSketch to address shallow text-centric reasoning in vision-language models on complex visual problems. It generates intermediate visual sketches with external tools and interleaves them with textual steps in a VT-CoT process, using reflection for self-correction. A cold-start phase trains on a synthesized interleaved dataset, then reinforcement learning applies stepwise rewards to reduce sparsity in long sequences. Experiments on visual reasoning benchmarks show gains that exceed some proprietary models.

Core claim

InterSketch dynamically generates intermediate visual sketches using external tools and interleaves them with textual reasoning, enabling effective perception and logical reasoning over long-horizon visual understanding tasks. It builds this via a synthesized high-quality interleaved VT-CoT dataset with reflection in a cold-start stage, then applies a stepwise reward mechanism in reinforcement learning to handle sparse end-only signals.

What carries the argument

Interleaved visual-textual chain-of-thought with self-correcting sketches generated by external tools plus stepwise reward signals during RL training.

If this is right

  • Vision-language models gain the ability to maintain and revise visual perceptions across multiple reasoning turns rather than committing to a single text description.
  • Stepwise rewards allow credit assignment over extended sequences where only final task success is observable.
  • Self-correction via reflection reduces propagation of early perceptual errors into later logical steps.
  • The approach yields measurable gains on standard visual reasoning benchmarks that surpass certain closed-source models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar interleaving of generated intermediate representations could extend to sequential planning tasks that combine vision with action sequences.
  • Reducing dependence on external sketch tools by training an internal generator might lower latency while preserving the same reasoning structure.
  • The method suggests a path for hybrid systems that combine tool use with learned reasoning loops in other multimodal domains.

Load-bearing premise

A synthesized high-quality interleaved VT-CoT dataset together with a reflection mechanism suffices to bootstrap multi-turn interleaved reasoning and self-correction.

What would settle it

Training runs that omit the synthesized VT-CoT dataset or the stepwise reward component produce no gains over text-only baselines on long-horizon visual reasoning benchmarks.

Figures

Figures reproduced from arXiv: 2605.26520 by HanMing Deng, Jie Yang, Jingcheng Ni, Jixuan Ying, Lewei Lu, Shengnan Ma, Tao Hu, Wei Liu, Wenwen Tong, Xiangli Kong, Yong Xien Chng, Yuanjie Zheng, Zehuan Wu, Zhiwei Ning, Ziyi Shang.

Figure 1
Figure 1. Figure 1: Performance on the interleaved reasoning tasks. Our InterSketch consistently achieves [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cold-start data synthesis pipeline. The pipeline automatically generates high-quality [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Two-stage training framework of InterSketch. SFT employs cold-start supervised fine [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison between the w/ and w/o stepwise [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Data statistics for the SFT and RL training stages. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Case study of InterSketch on Jigsaw Game. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Case study of InterSketch on Maze. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Case study of InterSketch on VSP. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Case study of InterSketch on Math. How many edges are there in this polygon? A.12 B.13 C.14 D.15 E.16 <think>To determine the number of edges in the polygon, I observe that each edge is formed by connecting one vertex to the next in sequence, continuing around the shape until the final vertex connects back to the starting point, closing the polygon. Since every vertex marks a corner where two edges meet, t… view at source ↗
Figure 10
Figure 10. Figure 10: Case study of InterSketch on Symbolic Reasoning. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Case study of InterSketch on Visual Search. [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Case study of InterSketch on Visual Search. [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Case study of InterSketch on Proportion VQA. [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Case study of InterSketch on Math with reflection. [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Case study of InterSketch on Math with reflection. [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Failure case study of InterSketch on Math. [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Failure case study of InterSketch on VSP. [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
read the original abstract

While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively shallow and are dominated by a text-centric paradigm, limiting their applicability to complex visual challenges. In contrast, human-like thought typically involves long-horizon reasoning with an interleaved visual-textual chain-of-thought (VT-CoT). To bridge this gap, we introduce InterSketch, an interleaved reasoning model to enhance the VT-CoT capability via self-correcting and stepwise reward mechanisms. InterSketch dynamically generates intermediate visual sketches using external tools and interleaves them with textual reasoning, enabling effective perception and logical reasoning over long-horizon visual understanding tasks. Specifically, in the first cold-start stage, we propose a synthesized high-quality interleaved VT-CoT dataset and include a reflection mechanism to enable the model's capability in multi-turn interleaved reasoning and self-correction. In the subsequent reinforcement learning (RL) stage, we design a stepwise reward mechanism to mitigate the sparsity of reward signals inherent in end-only supervision over long-horizon reasoning. Extensive experiments on visual reasoning benchmarks demonstrate the effectiveness of InterSketch, even outperforming proprietary models such as Gemini-3-Pro.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces InterSketch, an interleaved reasoning model for vision-language models that enhances visual-textual chain-of-thought (VT-CoT) via self-correcting visual sketches generated dynamically with external tools and a stepwise reward in RL. It consists of a cold-start stage using a synthesized high-quality interleaved VT-CoT dataset plus reflection mechanism to bootstrap multi-turn reasoning and self-correction, followed by an RL stage with stepwise reward to address sparse end-only supervision in long-horizon tasks. The paper claims this enables effective perception and logical reasoning on complex visual challenges and reports outperformance over proprietary models such as Gemini-3-Pro on visual reasoning benchmarks.

Significance. If the central claims hold after proper validation, the work could advance VLMs beyond text-centric reasoning toward more human-like interleaved VT-CoT, with the external-tool sketch generation and stepwise reward addressing key limitations in long-horizon visual tasks. The two-stage training paradigm is a clear contribution if the bootstrap and reward design are shown to be effective via ablations.

major comments (2)
  1. [Abstract] Abstract: the claim of outperforming Gemini-3-Pro supplies no information on dataset sizes, baseline implementations, statistical tests, error bars, or exclusion criteria, rendering it impossible to judge whether the reported results support the central claim.
  2. [Abstract] Cold-start stage description (Abstract): the sufficiency of the synthesized high-quality interleaved VT-CoT dataset together with the reflection mechanism for bootstrapping multi-turn interleaved reasoning and self-correction is asserted without any description of the synthesis procedure, quality filters, coverage of long-horizon cases, or ablations isolating the cold-start contribution; this is load-bearing for the subsequent RL gains and overall outperformance claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comments on the abstract below and will revise the abstract to improve the presentation of our claims and method.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of outperforming Gemini-3-Pro supplies no information on dataset sizes, baseline implementations, statistical tests, error bars, or exclusion criteria, rendering it impossible to judge whether the reported results support the central claim.

    Authors: We acknowledge that the abstract, due to space constraints, does not include these evaluation details. The full manuscript reports dataset sizes in Section 4.1, baseline implementations and training details in Section 4.2, and statistical tests with error bars plus exclusion criteria in Section 5 and Appendix B. We will revise the abstract to qualify the outperformance claim and explicitly reference the main text for the full evaluation protocol. revision: yes

  2. Referee: [Abstract] Cold-start stage description (Abstract): the sufficiency of the synthesized high-quality interleaved VT-CoT dataset together with the reflection mechanism for bootstrapping multi-turn interleaved reasoning and self-correction is asserted without any description of the synthesis procedure, quality filters, coverage of long-horizon cases, or ablations isolating the cold-start contribution; this is load-bearing for the subsequent RL gains and overall outperformance claim.

    Authors: The synthesis procedure, quality filters, long-horizon coverage, and reflection mechanism are described in Section 3.1, while ablations isolating the cold-start stage appear in Section 5.3. We agree the abstract would benefit from a concise summary of these elements. We will revise the abstract to briefly describe the synthesis approach and note the supporting ablations. revision: yes

Circularity Check

0 steps flagged

No circularity: method uses external tools and independently synthesized dataset

full rationale

The paper presents a two-stage procedure (cold-start on a synthesized interleaved VT-CoT dataset plus reflection, followed by RL with stepwise reward) that relies on external sketch-generation tools and a separately constructed training set. No equations, fitted parameters renamed as predictions, or self-referential definitions appear. The central claims rest on the empirical performance of the resulting model rather than any derivation that reduces to its own inputs by construction. No load-bearing self-citations or uniqueness theorems imported from the authors' prior work are invoked in the supplied text.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified quality of a synthesized dataset and the effectiveness of an externally designed stepwise reward; both are introduced without independent evidence in the abstract.

free parameters (1)
  • stepwise reward formulation
    The design of per-step rewards is a modeling choice whose specific weights or shaping functions are not stated and would need fitting or hand-tuning.
axioms (1)
  • domain assumption Synthesized interleaved VT-CoT dataset plus reflection mechanism suffices to instill multi-turn self-correction
    Invoked in the description of the cold-start stage as the basis for subsequent RL.
invented entities (1)
  • stepwise reward mechanism no independent evidence
    purpose: Mitigate reward sparsity in long-horizon reasoning
    New component introduced in the RL stage; no independent falsifiable prediction supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5785 in / 1472 out tokens · 57616 ms · 2026-06-29T17:49:50.482671+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 27 canonical work pages · 19 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv:2508.18265, 2025

  3. [3]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  4. [4]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv:2501.12948, 2025

  5. [5]

    Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

    Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025

  6. [6]

    Thinking with images.https://openai.com/index/thinking-with-images/, 2025

    OpenAI. Thinking with images.https://openai.com/index/thinking-with-images/, 2025

  7. [7]

    Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning.arXiv preprint arXiv:2510.27492, 2025

    Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, and Yu Cheng. Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning.arXiv preprint arXiv:2510.27492, 2025

  8. [8]

    Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

    Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025

  9. [9]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

  10. [10]

    Thyme: Think Beyond Images

    Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025

  11. [11]

    Realunify: Do unified models truly benefit from unification? a comprehensive benchmark.arXiv preprint arXiv:2509.24897, 2025

    Yang Shi, Yuhao Dong, Yue Ding, Yuran Wang, Xuanyu Zhu, Sheng Zhou, Wenting Liu, Haochen Tian, Rundong Wang, Huanqian Wang, et al. Realunify: Do unified models truly benefit from unification? a comprehensive benchmark.arXiv preprint arXiv:2509.24897, 2025

  12. [12]

    Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

    Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli ´c, and Furu Wei. Imagine while reasoning in space: Multimodal visualization-of-thought.arXiv preprint arXiv:2501.07542, 2025

  13. [13]

    DeepEyesV2: Toward Agentic Multimodal Model

    Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv:2511.05271, 2025

  14. [14]

    Thinking with programming vision: Towards a unified view for thinking with images.arXiv preprint arXiv:2512.03746, 2025

    Zirun Guo, Minjie Hong, Feng Zhang, Kai Jia, and Tao Jin. Thinking with programming vision: Towards a unified view for thinking with images.arXiv preprint arXiv:2512.03746, 2025. 10

  15. [15]

    arXiv preprint arXiv:2512.24330 (2025)

    Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, et al. Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

  16. [16]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

  17. [17]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

  18. [18]

    Vsp: Assessing the dual challenges of perception and reasoning in spatial planning tasks for vlms.arXiv preprint arXiv:2407.01863, 2024

    Qiucheng Wu, Handong Zhao, Michael Saxon, Trung Bui, William Yang Wang, Yang Zhang, and Shiyu Chang. Vsp: Assessing the dual challenges of perception and reasoning in spatial planning tasks for vlms.arXiv preprint arXiv:2407.01863, 2024

  19. [19]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

  20. [20]

    Otter: A multi-modal model with in-context instruction tuning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Joshua Adrian Cahyono, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  21. [21]

    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023

  22. [22]

    MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023

  23. [23]

    OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

    Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025

  24. [24]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

  25. [25]

    Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025

    Zetong Zhou, Dongping Chen, Zixian Ma, Zhihan Hu, Mingyang Fu, Sinan Wang, Yao Wan, Zhou Zhao, and Ranjay Krishna. Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025

  26. [26]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv:2402.03300, 2024

  27. [27]

    Swift: a scalable lightweight infrastructure for fine-tuning

    Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al. Swift: a scalable lightweight infrastructure for fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29733–29735, 2025

  28. [28]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In EuroSys, 2025

  29. [29]

    Tir-bench: A comprehensive benchmark for agentic thinking-with-images reasoning.arXiv preprint arXiv:2511.01833, 2025

    Ming Li, Jike Zhong, Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Yuxiang Lai, Chen Wei, Konstantinos Psounis, and Kaipeng Zhang. Tir-bench: A comprehensive benchmark for agentic thinking-with-images reasoning.arXiv preprint arXiv:2511.01833, 2025. 11

  30. [30]

    Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

    Kai Zou, Ziqi Huang, Yuhao Dong, Shulin Tian, Dian Zheng, Hongbo Liu, Jingwen He, Bin Liu, Yu Qiao, and Ziwei Liu. Uni-mmmu: A massive multi-discipline multimodal unified benchmark.arXiv preprint arXiv:2510.13759, 2025

  31. [31]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024

  32. [32]

    Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

  33. [33]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

  34. [34]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv:2410.21276, 2024

  35. [35]

    Gpt-5.https://openai.com/gpt-5, 2025

    OpenAI. Gpt-5.https://openai.com/gpt-5, 2025

  36. [36]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv:2507.06261, 2025

  37. [37]

    Gemini-3-pro.https://deepmind.google/models/gemini/pro/, 2025

    Gemini. Gemini-3-pro.https://deepmind.google/models/gemini/pro/, 2025

  38. [38]

    Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use.arXiv preprint arXiv:2505.19255, 2025

    Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, and Klara Nahrstedt. Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use.arXiv preprint arXiv:2505.19255, 2025

  39. [39]

    DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning

    Chi Zhang, Haibo Qiu, Qiming Zhang, Zhixiong Zeng, Lin Ma, and Jing Zhang. Deepsketcher: Internalizing visual manipulation for multimodal reasoning.arXiv preprint arXiv:2509.25866, 2025. 12 Table 8: Comparison with previous tool-augmented visual reasoning methods. Method Avg. Tool Calls Task Diversity Multi-Step Chain Stepwise Reward Reflection OpenThinkI...