Recognition: unknown
Visual Reasoning through Tool-supervised Reinforcement Learning
Pith reviewed 2026-05-10 02:59 UTC · model grok-4.3
The pith
A two-stage reinforcement learning curriculum with direct tool supervision enables multimodal models to master simple visual tools before tackling complex reasoning tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that direct supervision on a set of simple visual tools combined with a curriculum of first optimizing tool-specific rewards alone and then accuracy rewards while permitting tool calls produces efficient learning and strong tool-use capabilities for complex visual reasoning tasks.
What carries the argument
The ToolsRL two-stage reinforcement learning curriculum, where the first stage optimizes exclusively on tool-specific rewards for actions such as zoom-in, rotate, flip, and draw point/line before the second stage adds task accuracy rewards with tool access allowed.
If this is right
- Tool calling capability is mastered independently before being applied to complete visual reasoning tasks.
- Potential optimization conflicts between learning tool use and achieving task accuracy are avoided.
- Training becomes more efficient while still reaching strong capabilities on complex visual reasoning.
- A small set of native interpretable tools proves adequate once the model has learned to call them reliably.
Where Pith is reading between the lines
- The staged supervision method could be tested on non-visual tools or other agent skills where intermediate capabilities conflict with final goals.
- Similar curricula might improve reliability when deploying models in interactive settings that require repeated tool use over time.
- Scaling the same separation to larger models or more varied visual tools would reveal whether the efficiency gains hold beyond the reported experiments.
Load-bearing premise
That separating tool mastery into a dedicated first stage with tool-specific rewards avoids optimization conflicts with task accuracy, and that supervision for the chosen simple visual tools is easy to collect and sufficient for complex reasoning.
What would settle it
A single-stage training run that jointly optimizes tool rewards and accuracy rewards achieving equal or better tool-use performance and task accuracy than the two-stage curriculum would falsify the necessity of the separation.
Figures
read the original abstract
In this paper, we investigate the problem of how to effectively master tool-use to solve complex visual reasoning tasks for Multimodal Large Language Models. To achieve that, we propose a novel Tool-supervised Reinforcement Learning (ToolsRL) framework, with direct tool supervision for more effective tool-use learning. We focus on a series of simple, native, and interpretable visual tools, including zoom-in, rotate, flip, and draw point/line, whose tool supervision is easy to collect. A reinforcement learning curriculum is developed, where the first stage is solely optimized by a set of well motivated tool-specific rewards, and the second stage is trained with the accuracy targeted rewards while allowing calling tools. In this way, tool calling capability is mastered before using tools to complete visual reasoning tasks, avoiding the potential optimization conflict among those heterogeneous tasks. Our experiments have shown that the tool-supervised curriculum training is efficient and ToolsRL can achieve strong tool-use capabilities for complex visual reasoning tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ToolsRL, a two-stage reinforcement learning framework for multimodal large language models to master tool use in complex visual reasoning. It introduces simple visual tools (zoom-in, rotate, flip, draw point/line) with easily collectible supervision. Stage 1 optimizes solely on tool-specific rewards to master tool calling; Stage 2 adds task-accuracy rewards while permitting tool use. The authors claim this curriculum avoids optimization conflicts between heterogeneous objectives and that experiments demonstrate efficient training and strong tool-use capabilities.
Significance. If the empirical claims hold, the work could offer a practical curriculum-based approach to tool-augmented visual reasoning that separates tool proficiency from final-task optimization. The emphasis on native, interpretable tools with straightforward supervision is a constructive design choice that may generalize beyond the specific tools tested. However, without detailed quantitative results, baselines, or ablations, the magnitude of any advance over standard RL or prompting methods remains difficult to gauge.
major comments (2)
- [Methods / Experiments] The central design claim—that the two-stage curriculum avoids optimization conflicts between tool-specific rewards and accuracy rewards—lacks any supporting ablation or control experiment. No joint-optimization baseline, single-stage training run, or reward-weighting comparison is described, leaving the necessity and benefit of the separation as an untested modeling choice rather than a demonstrated requirement (see the curriculum description and experimental claims).
- [Abstract / Experiments] The abstract and experimental summary assert that 'the tool-supervised curriculum training is efficient and ToolsRL can achieve strong tool-use capabilities,' yet no quantitative metrics, error bars, baseline comparisons, or ablation tables are referenced in the provided text. This absence directly undermines verification of the efficiency and capability claims that constitute the paper's primary contribution.
minor comments (2)
- [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., accuracy delta or sample efficiency) to ground the efficiency and capability assertions.
- [Methods] Notation for the tool-specific reward functions and the transition between stages could be made more explicit (e.g., by defining the reward components in an equation) to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below and describe the revisions that will be incorporated into the manuscript.
read point-by-point responses
-
Referee: [Methods / Experiments] The central design claim—that the two-stage curriculum avoids optimization conflicts between tool-specific rewards and accuracy rewards—lacks any supporting ablation or control experiment. No joint-optimization baseline, single-stage training run, or reward-weighting comparison is described, leaving the necessity and benefit of the separation as an untested modeling choice rather than a demonstrated requirement (see the curriculum description and experimental claims).
Authors: We agree that the current manuscript does not contain explicit ablation studies comparing the two-stage curriculum to joint optimization or single-stage alternatives. The separation is motivated by the distinct objectives of the reward signals (tool invocation precision versus end-task accuracy), which can create optimization tension when combined from the start. To substantiate this, we will add a dedicated ablation subsection in the revised Experiments section that includes (i) a joint-optimization baseline trained with a combined reward, (ii) a single-stage accuracy-only run, and (iii) quantitative comparisons of tool-calling success rate and final task accuracy. These results will be presented with error bars and statistical significance tests. revision: yes
-
Referee: [Abstract / Experiments] The abstract and experimental summary assert that 'the tool-supervised curriculum training is efficient and ToolsRL can achieve strong tool-use capabilities,' yet no quantitative metrics, error bars, baseline comparisons, or ablation tables are referenced in the provided text. This absence directly undermines verification of the efficiency and capability claims that constitute the paper's primary contribution.
Authors: We acknowledge that the submitted abstract and summary statements do not reference specific numerical results. The full manuscript contains quantitative evaluations in Section 4, including tool-use accuracy, task success rates, training efficiency curves, and comparisons against prompting and standard RL baselines. In the revision we will update the abstract to cite the key metrics (e.g., relative improvements in tool-calling success and overall accuracy) and will add explicit cross-references to the corresponding tables and figures so that the claims are directly verifiable from the abstract. revision: yes
Circularity Check
No circularity: methodological proposal with no derivations or self-referential reductions
full rationale
The paper proposes a ToolsRL framework consisting of a two-stage curriculum (tool-specific rewards first, then accuracy rewards) and simple visual tools. No equations, derivations, or first-principles results are present in the abstract or described text. The curriculum separation is presented as a design choice motivated by avoiding optimization conflicts, not as a derived necessity that reduces to its own inputs. No self-citations, fitted parameters renamed as predictions, or ansatzes appear. Experimental claims rest on empirical results rather than tautological reasoning, making the derivation chain self-contained with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Tool supervision for zoom-in, rotate, flip, and draw operations is easy to collect
- domain assumption Separating tool mastery from task accuracy avoids optimization conflicts among heterogeneous objectives
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Yang Chen, Yufan Shen, Wenxuan Huang, Sheng Zhou, Qunshu Lin, Xinyu Cai, Zhi Yu, Jiajun Bu, Botian Shi, and Yu Qiao. Learning only with images: Visual reinforce- ment learning with reasoning, rendering, and visual feed- back.arXiv preprint arXiv:2507.20766, 2025. 1, 2, 3
-
[3]
Cot re- ferring: Improving referring expression tasks with grounded reasoning, 2025
Qihua Dong, Luis Figueroa, Handong Zhao, Kushal Kafle, Jason Kuen, Zhihong Ding, Scott Cohen, and Yun Fu. Cot re- ferring: Improving referring expression tasks with grounded reasoning, 2025. 1
2025
-
[4]
arXiv preprint arXiv:2406.09403 (2024)
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Osten- dorf, Luke Zettlemoyer, Noah A. Smith, and Ranjay Kr- ishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.arXiv preprint arXiv:2406.09403, 2024. 1, 2
-
[5]
Vision-r1: Incentivizing reasoning capability in multimodal large lan- guage models, 2025
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large lan- guage models, 2025. 2
2025
-
[6]
Rl is neither a panacea nor a mirage: Understanding supervised vs
Hangzhan Jin, Sicheng Lv, Sifan Wu, and Mohammad Ham- daqa. Rl is neither a panacea nor a mirage: Understanding supervised vs. reinforcement learning fine-tuning for LLMs. arXiv preprint arXiv:2508.16546, 2025. 1
-
[7]
TableVQA-Bench: A visual question answering benchmark on multiple table domains, 2024
Yoonsik Kim, Moonbin Yim, and Ka Yeon Song. TableVQA-Bench: A visual question answering benchmark on multiple table domains, 2024. 5
2024
-
[8]
Sunil Kumar, Bowen Zhao, Leo Dirac, and Paulina Var- shavskaya. Reinforcing vlms to use tools for detailed vi- sual reasoning under resource constraints.arXiv preprint arXiv:2506.14821, 2025. 1, 2
-
[9]
Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search. arXiv preprint arXiv:2509.07969, 2025. 1, 2, 5, 6, 7
-
[10]
Llava-onevision: Easy visual task transfer, 2024
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer, 2024. 1
2024
-
[11]
Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models
Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14369–14387, Bangkok, Thailand, 2024. As- sociation ...
2024
-
[12]
ChartQA: A benchmark for question answering about charts with visual and logical reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, 2022. Associ- ation for Computational Linguistics. 5
2022
-
[13]
ChartQAPro: A more diverse and challenging benchmark for chart question answering
Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tah- mid Rahman Laskar, Mizanur Rahman, Shadikur Rah- man, Mehrad Shahmohammadi, Megh Thakkar, Md Rizwan Parvez, Enamul Hoque, and Shafiq Joty. ChartQAPro: A more diverse and challenging benchmark for chart question answering. InFindings of the Association for ...
2025
- [14]
-
[15]
Minesh Mathew, Viraj Bagal, Rub ´en P´erez Tito, Dimosthe- nis Karatzas, Ernest Valveny, and C. V . Jawahar. Infograph- icvqa. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2582– 2591, 2022. 5, 6
2022
-
[16]
Point-rft: Im- proving multimodal reasoning with visually grounded rein- forcement finetuning, 2025
Minheng Ni, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Kevin Lin, Wangmeng Zuo, and Lijuan Wang. Point-rft: Im- proving multimodal reasoning with visually grounded rein- forcement finetuning, 2025. 2, 6
2025
-
[17]
OpenAI. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Openai o3 system card
OpenAI. Openai o3 system card. https://openai.com/o3,
-
[19]
Accessed: 2024-12-20. 1
2024
-
[20]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 1, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Zoom- eye: Enhancing multimodal LLMs with human-like zooming capabilities through tree-based image exploration
Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, and Jianwei Yin. Zoom- eye: Enhancing multimodal LLMs with human-like zooming capabilities through tree-based image exploration. InPro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6613–6629, Suzhou, China, 2025. Associat...
2025
-
[22]
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 6, 7
work page internal anchor Pith review arXiv 2025
-
[23]
Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, and Yu Cheng. Openthinkimg: Learning to think with images via visual tool reinforcement learning. arXiv preprint arXiv:2505.08617, 2025. 1, 2, 6, 7
-
[24]
Reason-rft: Reinforcement fine-tuning for visual reasoning,
Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning,
-
[25]
Divide, conquer 9 and combine: A training-free framework for high-resolution image perception in multimodal large language models
Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer 9 and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 7907–7915, 2025. 5
2025
-
[26]
Simple o3: To- wards interleaved vision-language reasoning.arXiv preprint arXiv:2508.12109, 2025
Ye Wang, Qianglong Chen, Zejun Li, Siyuan Wang, Shi- jie Guo, Zhirui Zhang, and Zhongyu Wei. Simple o3: To- wards interleaved vision-language reasoning.arXiv preprint arXiv:2508.12109, 2025. 1, 2, 6
-
[27]
Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sad- hika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. CharXiv: Charting gaps in realistic chart understand- ing in multimodal llms.CoRR, abs/2406.18521, 2024. 5
-
[28]
Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven think- ing and visual drawing.arXiv preprint arXiv:2506.09965,
-
[29]
Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, and Klara Nahrstedt. Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use. arXiv preprint arXiv:2505.19255, 2025. 1, 2, 6, 7
-
[30]
V* : Guided visual search as a core mechanism in multimodal llms
Penghao Wu and Saining Xie. V* : Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 5
2024
-
[31]
Look-back: Implicit visual re-focusing in MLLM reasoning.CoRR, abs/2507.03019, 2025
Shuo Yang, Yuwei Niu, Yuyang Liu, Yang Ye, Bin Lin, and Li Yuan. Look-back: Implicit visual re-focusing in mllm reasoning.arXiv preprint arXiv:2507.03019, 2025. 2
-
[32]
R1-sharevl: Incentivizing reasoning capability of multimodal large language models via share- grpo, 2025
Huanjin Yao, Qixiang Yin, Jingyi Zhang, Min Yang, Yibo Wang, Wenhao Wu, Fei Su, Li Shen, Minghui Qiu, Dacheng Tao, and Jiaxing Huang. R1-sharevl: Incentivizing reasoning capability of multimodal large language models via share- grpo, 2025. 2
2025
-
[33]
Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, and Jingren Zhou. On-policy RL meets off-policy experts: Harmonizing su- pervised fine-tuning and reinforcement learning via dynamic weighting.arXiv preprint arXiv:2508.11408, 2025. 1
-
[34]
Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xi- aowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, and Qing Li. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl. arXiv preprint arXiv:2505.15436, 2025. 1, 2
-
[35]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- Eyes: Incentivizing “thinking with images” via reinforce- ment learning.CoRR, abs/2505.14362, 2025. 1, 2, 3, 5, 6, 7
work page internal anchor Pith review arXiv 2025
-
[36]
disable” ( −5.24 on V*) and “replace with placeholders
Zetong Zhou, Dongping Chen, Zixian Ma, Zhihan Hu, Mingyang Fu, Sinan Wang, Yao Wan, Zhou Zhao, and Ran- jay Krishna. Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025. 7 10 ToolsRL Given <zoom-in> tool and the <image>, what is the word to the left of "WAY"? w/ Accuracy Reward First, I will zoom in on the sign to the left of "WA...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.