InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

HanMing Deng; Jie Yang; Jingcheng Ni; Jixuan Ying; Lewei Lu; Shengnan Ma; Tao Hu; Wei Liu; Wenwen Tong; Xiangli Kong

arxiv: 2605.26520 · v1 · pith:UGZKVVX6new · submitted 2026-05-26 · 💻 cs.CV · cs.AI

InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

Zhiwei Ning , Wenwen Tong , Xiangli Kong , Shengnan Ma , Ziyi Shang , Jingcheng Ni , Tao Hu , Yong Xien Chng

show 7 more authors

Jixuan Ying Zehuan Wu Hanming Deng Jie Yang Yuanjie Zheng Wei Liu Lewei Lu

This is my paper

Pith reviewed 2026-06-29 17:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords interleaved reasoningvisual-textual chain-of-thoughtself-correcting visual sketchesstepwise rewardvision-language modelslong-horizon visual reasoning

0 comments

The pith

InterSketch interleaves self-correcting visual sketches with text reasoning and uses stepwise rewards for long-horizon visual tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InterSketch to address shallow text-centric reasoning in vision-language models on complex visual problems. It generates intermediate visual sketches with external tools and interleaves them with textual steps in a VT-CoT process, using reflection for self-correction. A cold-start phase trains on a synthesized interleaved dataset, then reinforcement learning applies stepwise rewards to reduce sparsity in long sequences. Experiments on visual reasoning benchmarks show gains that exceed some proprietary models.

Core claim

InterSketch dynamically generates intermediate visual sketches using external tools and interleaves them with textual reasoning, enabling effective perception and logical reasoning over long-horizon visual understanding tasks. It builds this via a synthesized high-quality interleaved VT-CoT dataset with reflection in a cold-start stage, then applies a stepwise reward mechanism in reinforcement learning to handle sparse end-only signals.

What carries the argument

Interleaved visual-textual chain-of-thought with self-correcting sketches generated by external tools plus stepwise reward signals during RL training.

If this is right

Vision-language models gain the ability to maintain and revise visual perceptions across multiple reasoning turns rather than committing to a single text description.
Stepwise rewards allow credit assignment over extended sequences where only final task success is observable.
Self-correction via reflection reduces propagation of early perceptual errors into later logical steps.
The approach yields measurable gains on standard visual reasoning benchmarks that surpass certain closed-source models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar interleaving of generated intermediate representations could extend to sequential planning tasks that combine vision with action sequences.
Reducing dependence on external sketch tools by training an internal generator might lower latency while preserving the same reasoning structure.
The method suggests a path for hybrid systems that combine tool use with learned reasoning loops in other multimodal domains.

Load-bearing premise

A synthesized high-quality interleaved VT-CoT dataset together with a reflection mechanism suffices to bootstrap multi-turn interleaved reasoning and self-correction.

What would settle it

Training runs that omit the synthesized VT-CoT dataset or the stepwise reward component produce no gains over text-only baselines on long-horizon visual reasoning benchmarks.

Figures

Figures reproduced from arXiv: 2605.26520 by HanMing Deng, Jie Yang, Jingcheng Ni, Jixuan Ying, Lewei Lu, Shengnan Ma, Tao Hu, Wei Liu, Wenwen Tong, Xiangli Kong, Yong Xien Chng, Yuanjie Zheng, Zehuan Wu, Zhiwei Ning, Ziyi Shang.

**Figure 2.** Figure 2: Cold-start data synthesis pipeline. The pipeline automatically generates high-quality [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Two-stage training framework of InterSketch. SFT employs cold-start supervised fine [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison between the w/ and w/o stepwise [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Data statistics for the SFT and RL training stages. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Case study of InterSketch on Jigsaw Game. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Case study of InterSketch on Maze. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Case study of InterSketch on VSP. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Case study of InterSketch on Math. How many edges are there in this polygon? A.12 B.13 C.14 D.15 E.16 <think>To determine the number of edges in the polygon, I observe that each edge is formed by connecting one vertex to the next in sequence, continuing around the shape until the final vertex connects back to the starting point, closing the polygon. Since every vertex marks a corner where two edges meet, t… view at source ↗

**Figure 10.** Figure 10: Case study of InterSketch on Symbolic Reasoning. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Case study of InterSketch on Visual Search. [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Case study of InterSketch on Visual Search. [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Case study of InterSketch on Proportion VQA. [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Case study of InterSketch on Math with reflection. [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: Case study of InterSketch on Math with reflection. [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: Failure case study of InterSketch on Math. [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: Failure case study of InterSketch on VSP. [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗

read the original abstract

While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively shallow and are dominated by a text-centric paradigm, limiting their applicability to complex visual challenges. In contrast, human-like thought typically involves long-horizon reasoning with an interleaved visual-textual chain-of-thought (VT-CoT). To bridge this gap, we introduce InterSketch, an interleaved reasoning model to enhance the VT-CoT capability via self-correcting and stepwise reward mechanisms. InterSketch dynamically generates intermediate visual sketches using external tools and interleaves them with textual reasoning, enabling effective perception and logical reasoning over long-horizon visual understanding tasks. Specifically, in the first cold-start stage, we propose a synthesized high-quality interleaved VT-CoT dataset and include a reflection mechanism to enable the model's capability in multi-turn interleaved reasoning and self-correction. In the subsequent reinforcement learning (RL) stage, we design a stepwise reward mechanism to mitigate the sparsity of reward signals inherent in end-only supervision over long-horizon reasoning. Extensive experiments on visual reasoning benchmarks demonstrate the effectiveness of InterSketch, even outperforming proprietary models such as Gemini-3-Pro.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InterSketch's idea of tool-generated self-correcting sketches plus stepwise rewards for VT-CoT is a reasonable response to text-centric limits, but the abstract leaves the cold-start bootstrap and results completely unverified.

read the letter

The paper's main move is to insert dynamically generated visual sketches into VLM reasoning chains so that perception and logic can alternate over long tasks. It does this in two stages: a cold-start on a synthesized interleaved VT-CoT dataset that includes a reflection step for self-correction, followed by RL that supplies rewards at each step rather than only at the end.

The specific pairing of external-tool sketches, reflection during cold-start, and stepwise reward is not something already described in the cited work, so that combination counts as the concrete novelty. The motivation is also sound; many groups have noted that pure text CoT stays shallow on visual problems, and adding an explicit visual intermediate is a direct way to test whether that helps.

The soft spot is exactly where the stress-test note points. The abstract gives no procedure for building the synthesized dataset, no quality checks, no coverage of long-horizon examples, and no ablation that isolates what the cold-start actually contributes. Without those, it is impossible to know whether the base model even acquires the multi-turn interleaved behavior before RL begins. The results claim (outperforming Gemini-3-Pro) is stated without dataset sizes, baseline code, error bars, or statistical tests, so the performance numbers cannot be evaluated either.

This is the sort of paper that would interest people already working on visual chain-of-thought and tool-augmented VLMs. A reader looking for a concrete training recipe could pull useful pieces from the high-level design, but anyone needing reproducible evidence would find the current write-up insufficient.

I would send it to peer review because the underlying limitation it targets is real and the proposed structure is coherent on its own terms, but the referees would need to see the missing synthesis details, ablations, and full experimental controls before any stronger claim could be accepted.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces InterSketch, an interleaved reasoning model for vision-language models that enhances visual-textual chain-of-thought (VT-CoT) via self-correcting visual sketches generated dynamically with external tools and a stepwise reward in RL. It consists of a cold-start stage using a synthesized high-quality interleaved VT-CoT dataset plus reflection mechanism to bootstrap multi-turn reasoning and self-correction, followed by an RL stage with stepwise reward to address sparse end-only supervision in long-horizon tasks. The paper claims this enables effective perception and logical reasoning on complex visual challenges and reports outperformance over proprietary models such as Gemini-3-Pro on visual reasoning benchmarks.

Significance. If the central claims hold after proper validation, the work could advance VLMs beyond text-centric reasoning toward more human-like interleaved VT-CoT, with the external-tool sketch generation and stepwise reward addressing key limitations in long-horizon visual tasks. The two-stage training paradigm is a clear contribution if the bootstrap and reward design are shown to be effective via ablations.

major comments (2)

[Abstract] Abstract: the claim of outperforming Gemini-3-Pro supplies no information on dataset sizes, baseline implementations, statistical tests, error bars, or exclusion criteria, rendering it impossible to judge whether the reported results support the central claim.
[Abstract] Cold-start stage description (Abstract): the sufficiency of the synthesized high-quality interleaved VT-CoT dataset together with the reflection mechanism for bootstrapping multi-turn interleaved reasoning and self-correction is asserted without any description of the synthesis procedure, quality filters, coverage of long-horizon cases, or ablations isolating the cold-start contribution; this is load-bearing for the subsequent RL gains and overall outperformance claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comments on the abstract below and will revise the abstract to improve the presentation of our claims and method.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of outperforming Gemini-3-Pro supplies no information on dataset sizes, baseline implementations, statistical tests, error bars, or exclusion criteria, rendering it impossible to judge whether the reported results support the central claim.

Authors: We acknowledge that the abstract, due to space constraints, does not include these evaluation details. The full manuscript reports dataset sizes in Section 4.1, baseline implementations and training details in Section 4.2, and statistical tests with error bars plus exclusion criteria in Section 5 and Appendix B. We will revise the abstract to qualify the outperformance claim and explicitly reference the main text for the full evaluation protocol. revision: yes
Referee: [Abstract] Cold-start stage description (Abstract): the sufficiency of the synthesized high-quality interleaved VT-CoT dataset together with the reflection mechanism for bootstrapping multi-turn interleaved reasoning and self-correction is asserted without any description of the synthesis procedure, quality filters, coverage of long-horizon cases, or ablations isolating the cold-start contribution; this is load-bearing for the subsequent RL gains and overall outperformance claim.

Authors: The synthesis procedure, quality filters, long-horizon coverage, and reflection mechanism are described in Section 3.1, while ablations isolating the cold-start stage appear in Section 5.3. We agree the abstract would benefit from a concise summary of these elements. We will revise the abstract to briefly describe the synthesis approach and note the supporting ablations. revision: yes

Circularity Check

0 steps flagged

No circularity: method uses external tools and independently synthesized dataset

full rationale

The paper presents a two-stage procedure (cold-start on a synthesized interleaved VT-CoT dataset plus reflection, followed by RL with stepwise reward) that relies on external sketch-generation tools and a separately constructed training set. No equations, fitted parameters renamed as predictions, or self-referential definitions appear. The central claims rest on the empirical performance of the resulting model rather than any derivation that reduces to its own inputs by construction. No load-bearing self-citations or uniqueness theorems imported from the authors' prior work are invoked in the supplied text.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified quality of a synthesized dataset and the effectiveness of an externally designed stepwise reward; both are introduced without independent evidence in the abstract.

free parameters (1)

stepwise reward formulation
The design of per-step rewards is a modeling choice whose specific weights or shaping functions are not stated and would need fitting or hand-tuning.

axioms (1)

domain assumption Synthesized interleaved VT-CoT dataset plus reflection mechanism suffices to instill multi-turn self-correction
Invoked in the description of the cold-start stage as the basis for subsequent RL.

invented entities (1)

stepwise reward mechanism no independent evidence
purpose: Mitigate reward sparsity in long-horizon reasoning
New component introduced in the RL stage; no independent falsifiable prediction supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5785 in / 1472 out tokens · 57616 ms · 2026-06-29T17:49:50.482671+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 27 canonical work pages · 19 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022
[4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Thinking with images.https://openai.com/index/thinking-with-images/, 2025

OpenAI. Thinking with images.https://openai.com/index/thinking-with-images/, 2025

2025
[7]

Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning.arXiv preprint arXiv:2510.27492, 2025

Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, and Yu Cheng. Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning.arXiv preprint arXiv:2510.27492, 2025

work page arXiv 2025
[8]

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Thyme: Think Beyond Images

Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Realunify: Do unified models truly benefit from unification? a comprehensive benchmark.arXiv preprint arXiv:2509.24897, 2025

Yang Shi, Yuhao Dong, Yue Ding, Yuran Wang, Xuanyu Zhu, Sheng Zhou, Wenting Liu, Haochen Tian, Rundong Wang, Huanqian Wang, et al. Realunify: Do unified models truly benefit from unification? a comprehensive benchmark.arXiv preprint arXiv:2509.24897, 2025

work page arXiv 2025
[12]

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli ´c, and Furu Wei. Imagine while reasoning in space: Multimodal visualization-of-thought.arXiv preprint arXiv:2501.07542, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv:2511.05271, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Thinking with programming vision: Towards a unified view for thinking with images.arXiv preprint arXiv:2512.03746, 2025

Zirun Guo, Minjie Hong, Feng Zhang, Kai Jia, and Tao Jin. Thinking with programming vision: Towards a unified view for thinking with images.arXiv preprint arXiv:2512.03746, 2025. 10

work page arXiv 2025
[15]

arXiv preprint arXiv:2512.24330 (2025)

Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, et al. Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

work page arXiv 2025
[16]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

2024
[17]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Vsp: Assessing the dual challenges of perception and reasoning in spatial planning tasks for vlms.arXiv preprint arXiv:2407.01863, 2024

Qiucheng Wu, Handong Zhao, Michael Saxon, Trung Bui, William Yang Wang, Yang Zhang, and Shiyu Chang. Vsp: Assessing the dual challenges of perception and reasoning in spatial planning tasks for vlms.arXiv preprint arXiv:2407.01863, 2024

work page arXiv 2024
[19]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

2022
[20]

Otter: A multi-modal model with in-context instruction tuning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Joshua Adrian Cahyono, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[21]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025

Zetong Zhou, Dongping Chen, Zixian Ma, Zhihan Hu, Mingyang Fu, Sinan Wang, Yao Wan, Zhou Zhao, and Ranjay Krishna. Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025

work page arXiv 2025
[26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Swift: a scalable lightweight infrastructure for fine-tuning

Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al. Swift: a scalable lightweight infrastructure for fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29733–29735, 2025

2025
[28]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In EuroSys, 2025

2025
[29]

Tir-bench: A comprehensive benchmark for agentic thinking-with-images reasoning.arXiv preprint arXiv:2511.01833, 2025

Ming Li, Jike Zhong, Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Yuxiang Lai, Chen Wei, Konstantinos Psounis, and Kaipeng Zhang. Tir-bench: A comprehensive benchmark for agentic thinking-with-images reasoning.arXiv preprint arXiv:2511.01833, 2025. 11

work page arXiv 2025
[30]

Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

Kai Zou, Ziqi Huang, Yuhao Dong, Shulin Tian, Dian Zheng, Hongbo Liu, Jingwen He, Bin Liu, Yu Qiao, and Ziwei Liu. Uni-mmmu: A massive multi-discipline multimodal unified benchmark.arXiv preprint arXiv:2510.13759, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024

2024
[32]

Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

2024
[33]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

2024
[34]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Gpt-5.https://openai.com/gpt-5, 2025

OpenAI. Gpt-5.https://openai.com/gpt-5, 2025

2025
[36]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Gemini-3-pro.https://deepmind.google/models/gemini/pro/, 2025

Gemini. Gemini-3-pro.https://deepmind.google/models/gemini/pro/, 2025

2025
[38]

Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use.arXiv preprint arXiv:2505.19255, 2025

Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, and Klara Nahrstedt. Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use.arXiv preprint arXiv:2505.19255, 2025

work page arXiv 2025
[39]

DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning

Chi Zhang, Haibo Qiu, Qiming Zhang, Zhixiong Zeng, Lin Ma, and Jing Zhang. Deepsketcher: Internalizing visual manipulation for multimodal reasoning.arXiv preprint arXiv:2509.25866, 2025. 12 Table 8: Comparison with previous tool-augmented visual reasoning methods. Method Avg. Tool Calls Task Diversity Multi-Step Chain Stepwise Reward Reflection OpenThinkI...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022

[4] [4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Thinking with images.https://openai.com/index/thinking-with-images/, 2025

OpenAI. Thinking with images.https://openai.com/index/thinking-with-images/, 2025

2025

[7] [7]

Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning.arXiv preprint arXiv:2510.27492, 2025

Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, and Yu Cheng. Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning.arXiv preprint arXiv:2510.27492, 2025

work page arXiv 2025

[8] [8]

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Thyme: Think Beyond Images

Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Realunify: Do unified models truly benefit from unification? a comprehensive benchmark.arXiv preprint arXiv:2509.24897, 2025

Yang Shi, Yuhao Dong, Yue Ding, Yuran Wang, Xuanyu Zhu, Sheng Zhou, Wenting Liu, Haochen Tian, Rundong Wang, Huanqian Wang, et al. Realunify: Do unified models truly benefit from unification? a comprehensive benchmark.arXiv preprint arXiv:2509.24897, 2025

work page arXiv 2025

[12] [12]

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli ´c, and Furu Wei. Imagine while reasoning in space: Multimodal visualization-of-thought.arXiv preprint arXiv:2501.07542, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv:2511.05271, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Thinking with programming vision: Towards a unified view for thinking with images.arXiv preprint arXiv:2512.03746, 2025

Zirun Guo, Minjie Hong, Feng Zhang, Kai Jia, and Tao Jin. Thinking with programming vision: Towards a unified view for thinking with images.arXiv preprint arXiv:2512.03746, 2025. 10

work page arXiv 2025

[15] [15]

arXiv preprint arXiv:2512.24330 (2025)

Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, et al. Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

work page arXiv 2025

[16] [16]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

2024

[17] [17]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Vsp: Assessing the dual challenges of perception and reasoning in spatial planning tasks for vlms.arXiv preprint arXiv:2407.01863, 2024

Qiucheng Wu, Handong Zhao, Michael Saxon, Trung Bui, William Yang Wang, Yang Zhang, and Shiyu Chang. Vsp: Assessing the dual challenges of perception and reasoning in spatial planning tasks for vlms.arXiv preprint arXiv:2407.01863, 2024

work page arXiv 2024

[19] [19]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

2022

[20] [20]

Otter: A multi-modal model with in-context instruction tuning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Joshua Adrian Cahyono, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025

[21] [21]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025

Zetong Zhou, Dongping Chen, Zixian Ma, Zhihan Hu, Mingyang Fu, Sinan Wang, Yao Wan, Zhou Zhao, and Ranjay Krishna. Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025

work page arXiv 2025

[26] [26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Swift: a scalable lightweight infrastructure for fine-tuning

Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al. Swift: a scalable lightweight infrastructure for fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29733–29735, 2025

2025

[28] [28]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In EuroSys, 2025

2025

[29] [29]

Tir-bench: A comprehensive benchmark for agentic thinking-with-images reasoning.arXiv preprint arXiv:2511.01833, 2025

Ming Li, Jike Zhong, Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Yuxiang Lai, Chen Wei, Konstantinos Psounis, and Kaipeng Zhang. Tir-bench: A comprehensive benchmark for agentic thinking-with-images reasoning.arXiv preprint arXiv:2511.01833, 2025. 11

work page arXiv 2025

[30] [30]

Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

Kai Zou, Ziqi Huang, Yuhao Dong, Shulin Tian, Dian Zheng, Hongbo Liu, Jingwen He, Bin Liu, Yu Qiao, and Ziwei Liu. Uni-mmmu: A massive multi-discipline multimodal unified benchmark.arXiv preprint arXiv:2510.13759, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024

2024

[32] [32]

Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

2024

[33] [33]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

2024

[34] [34]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Gpt-5.https://openai.com/gpt-5, 2025

OpenAI. Gpt-5.https://openai.com/gpt-5, 2025

2025

[36] [36]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Gemini-3-pro.https://deepmind.google/models/gemini/pro/, 2025

Gemini. Gemini-3-pro.https://deepmind.google/models/gemini/pro/, 2025

2025

[38] [38]

Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use.arXiv preprint arXiv:2505.19255, 2025

Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, and Klara Nahrstedt. Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use.arXiv preprint arXiv:2505.19255, 2025

work page arXiv 2025

[39] [39]

DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning

Chi Zhang, Haibo Qiu, Qiming Zhang, Zhixiong Zeng, Lin Ma, and Jing Zhang. Deepsketcher: Internalizing visual manipulation for multimodal reasoning.arXiv preprint arXiv:2509.25866, 2025. 12 Table 8: Comparison with previous tool-augmented visual reasoning methods. Method Avg. Tool Calls Task Diversity Multi-Step Chain Stepwise Reward Reflection OpenThinkI...

work page internal anchor Pith review Pith/arXiv arXiv 2025