pith. machine review for the scientific record. sign in

arxiv: 2512.08980 · v3 · submitted 2025-12-05 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Training Multi-Image Vision Agents via End2End Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-17 00:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language modelsreinforcement learningtool usemulti-image reasoningvisual agentstrajectory maskingend-to-end trainingvisual reflection tools
0
0 comments X

The pith

IMAgent learns effective tool use for multi-image reasoning through pure end-to-end reinforcement learning without supervised fine-tuning data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces IMAgent, an open-source visual agent for fine-grained reasoning over single or multiple images. It counters the tendency of vision-language models to ignore visual content during long reasoning by adding two tools for reflection and verification. A two-layer motion trajectory masking strategy plus tool-use reward gain lets the base model acquire a useful tool-use policy solely through reinforcement learning. The approach also includes building a new visually rich multi-image QA dataset to fill training gaps. This yields state-of-the-art results on standard benchmarks while showing how tool use improves attention to images.

Core claim

Equipped with two dedicated tools for visual reflection and verification, IMAgent trains a base VLM end-to-end via reinforcement learning. A two-layer motion trajectory masking strategy and tool-use reward gain produce an effective tool-use paradigm without any supervised fine-tuning data. The method reveals that tool usage enhances performance by maintaining attention on image content and reaches SOTA results on single- and multi-image benchmarks.

What carries the argument

Two-layer motion trajectory masking strategy together with tool-use reward gain, which together shape the reinforcement learning signal to develop and sustain tool-use behavior.

Load-bearing premise

The combination of visual reflection and verification tools with the specific masking and reward design will cause the base VLM to learn and maintain an effective tool-use policy through reinforcement learning alone.

What would settle it

An ablation experiment in which removing the two-layer masking or the tool-use reward gain causes the model to stop using the reflection tools and produces no improvement over the base VLM on multi-image tasks.

Figures

Figures reproduced from arXiv: 2512.08980 by Chengqi Dong, Chuhuai Yue, Fenghe Tang, Guojun Yin, Hang He, Jiajun Chai, Rongge Mao, S Kevin Zhou, Xiaohan Wang, Zekun Xu.

Figure 1
Figure 1. Figure 1: We compared the attention proportions of newly generated tokens by Qwen2.5-VL-7B (a, b) and IMAgent (c, d) to the input [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Figure 2: Overview of IMAgent. Our model will automatically choose whether and how to use tools based on the actual problem. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A three-stage data construction pipeline based on multi-agent systems. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Action-Level and Trajectory-Level Mask. 3.3.2. Two-Level Mask Strategy To adapt to the training of visual tools, we design a two￾level mask strategy to enhance training stability (as illus￾trated in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Attention distribution across model layers. The first row [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Attention maps of the model using the visual confirma [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Demonstration of some typical tool use strategies. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of tools with and without trajectory masks. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Recent VLM-based agents aim to replicate OpenAI O3's "thinking with images" via tool use, yet most open-source methods restrict inputs to a single image, limiting their applicability to real-world multi-image QA tasks. To address this gap, we propose IMAgent, an open-source visual agent trained with end-to-end reinforcement learning for fine-grained single/multi-image reasoning. During inference, VLMs tend to gradually neglect visual inputs; to mitigate this issue, we design two dedicated tools for visual reflection and verification, enabling the model to actively refocus attention on image content. Beyond that, we, for the first time, reveal how tool usage enhances agent performance from an attention perspective. Equipped with a carefully designed two-layer motion trajectory masking strategy and tool-use reward gain, IMAgent acquires an effective tool-use paradigm through pure reinforcement learning, eliminating the need for costly supervised fine-tuning data. To further unleash the inherent tool-usage potential of the base VLM and fill data gaps, we construct a challenging, visually enriched multi-image QA dataset via multi-agent system. Extensive experiments validate that IMAgent achieves SOTA performance across mainstream single and multi-image benchmarks, and our in-depth analysis offers actionable insights for the community. Code and data will be released soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces IMAgent, a VLM-based agent for fine-grained single- and multi-image reasoning trained entirely via end-to-end reinforcement learning. It adds two dedicated tools (visual reflection and verification) to counteract gradual neglect of visual tokens, introduces a two-layer motion-trajectory masking strategy together with a tool-use reward gain, and claims that these components allow the base VLM to acquire an effective tool-use policy through pure RL without any supervised fine-tuning data. A multi-agent system is used to construct a visually enriched multi-image QA dataset, and the authors report SOTA results on mainstream single- and multi-image benchmarks while providing an attention-based analysis of why tool use improves performance.

Significance. If the performance claims and the causal contribution of the masking and reward components are substantiated by rigorous ablations and attention measurements, the work would constitute a meaningful step toward training multi-image VLM agents with minimal supervised data. The explicit attention-perspective analysis and the release of code and data would be additional strengths.

major comments (3)
  1. [§4] §4 (Experiments) and associated tables: The central SOTA claim and the assertion that the two-layer masking plus tool-use reward gain enable pure-RL acquisition of tool-use policy rest on performance numbers, error bars, and ablation tables that are not visible in the provided sections. Without these, it is impossible to verify whether the reported gains are robust or whether the masking and reward components are load-bearing.
  2. [§3.2] §3.2 (Method, attention analysis): The claim that tool usage enhances performance “from an attention perspective” and that the two-layer masking prevents gradual neglect of visual tokens requires quantitative attention-shift metrics or visualizations before and after the masking strategy; the current text supplies only qualitative description.
  3. [§3.1–3.3] §3.1–3.3 (Reward design and masking): The tool-use reward gain and the two-layer motion-trajectory masking are presented as key enablers of credit assignment across tool calls and image tokens, yet no ablation isolating each component (e.g., performance with/without masking, with/without reward gain) is shown; such ablations are necessary to support the “eliminating the need for costly supervised fine-tuning data” claim.
minor comments (2)
  1. [Abstract] The abstract states that code and data “will be released soon”; a concrete release plan or repository link should be added before publication.
  2. [§3.2] Notation for the two-layer masking (e.g., which layers correspond to which trajectory segments) should be defined explicitly in §3.2 with a small diagram or pseudocode.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to provide the requested quantitative evidence, tables, and ablations.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and associated tables: The central SOTA claim and the assertion that the two-layer masking plus tool-use reward gain enable pure-RL acquisition of tool-use policy rest on performance numbers, error bars, and ablation tables that are not visible in the provided sections. Without these, it is impossible to verify whether the reported gains are robust or whether the masking and reward components are load-bearing.

    Authors: We acknowledge the need for clear visibility of the supporting data. The full manuscript contains the SOTA results with error bars (from three independent runs) and ablation tables in Section 4. To improve accessibility, we have added a consolidated main-text table summarizing key metrics and component contributions, ensuring the robustness of the reported gains is directly verifiable. revision: yes

  2. Referee: [§3.2] §3.2 (Method, attention analysis): The claim that tool usage enhances performance “from an attention perspective” and that the two-layer masking prevents gradual neglect of visual tokens requires quantitative attention-shift metrics or visualizations before and after the masking strategy; the current text supplies only qualitative description.

    Authors: We agree that quantitative metrics strengthen the analysis. The revised manuscript now includes attention-shift metrics (average visual-token attention weight over reasoning steps) and before/after attention map visualizations. These additions quantify how the masking strategy counters visual neglect and how tool use alters attention distribution. revision: yes

  3. Referee: [§3.1–3.3] §3.1–3.3 (Reward design and masking): The tool-use reward gain and the two-layer motion-trajectory masking are presented as key enablers of credit assignment across tool calls and image tokens, yet no ablation isolating each component (e.g., performance with/without masking, with/without reward gain) is shown; such ablations are necessary to support the “eliminating the need for costly supervised fine-tuning data” claim.

    Authors: We concur that isolating each component is essential. We have added a dedicated ablation table in the revised manuscript comparing variants (no masking, no reward gain, and full model). The results demonstrate that both the two-layer masking and tool-use reward gain are necessary for successful pure-RL tool-use policy learning, directly supporting the claim regarding elimination of supervised fine-tuning data. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL training with independent dataset and rewards

full rationale

The paper presents an empirical training procedure for a multi-image VLM agent using end-to-end RL, custom reflection/verification tools, two-layer motion trajectory masking, and a tool-use reward gain. No equations, derivations, or first-principles predictions are offered that reduce performance claims to fitted parameters or self-referential definitions. The dataset is constructed separately via a multi-agent system to address data gaps, and results are validated on external benchmarks. The central claim—that these components enable effective tool-use policy acquisition without SFT—rests on experimental outcomes rather than any self-definitional or fitted-input reduction, rendering the approach self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that VLMs neglect visual inputs over long generations and on the design choices of custom tools, masking strategy, and reward shaping whose precise parameterization and interaction effects are not detailed in the abstract.

free parameters (1)
  • tool-use reward gain
    Scaling factor in the RL reward for tool calls; value and tuning procedure not specified in abstract.
axioms (1)
  • domain assumption VLMs tend to gradually neglect visual inputs during inference
    Invoked to motivate the design of reflection and verification tools.
invented entities (1)
  • IMAgent no independent evidence
    purpose: End-to-end RL-trained multi-image visual agent
    The model resulting from the described training procedure.

pith-pipeline@v0.9.0 · 5556 in / 1393 out tokens · 49671 ms · 2026-05-17T00:54:54.901263+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 10 internal anchors

  1. [1]

    Agentic ai: Autonomous intelligence for complex goals–a comprehensive survey.IEEe Access, 2025

    Deepak Bhaskar Acharya, Karthigeyan Kuppan, and B Di- vya. Agentic ai: Autonomous intelligence for complex goals–a comprehensive survey.IEEe Access, 2025. 1

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 5

  3. [3]

    Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272, 2025

    Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272, 2025. 1, 3

  4. [4]

    Rlfactory: A plug- and-play reinforcement learning post-training framework for llm multi-turn tool-use, 2025

    Jiajun Chai, Guojun Yin, Zekun Xu, Chuhuai Yue, Yi Jia, Siyu Xia, Xiaohan Wang, Jiwen Jiang, Xiaoguang Li, Chengqi Dong, Hang He, and Wei Lin. Rlfactory: A plug- and-play reinforcement learning post-training framework for llm multi-turn tool-use, 2025. 2

  5. [5]

    Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning, 2025

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning, 2025. 2

  6. [6]

    Insight-v: Ex- ploring long-chain visual reasoning with multimodal large language models, 2025

    Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, and Ziwei Liu. Insight-v: Ex- ploring long-chain visual reasoning with multimodal large language models, 2025. 2

  7. [7]

    Retool: Reinforcement learning for strategic tool use in llms, 2025

    Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yu- jia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms, 2025. 3

  8. [8]

    Video-r1: Reinforcing video reasoning in mllms, 2025

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms, 2025. 2

  9. [9]

    Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452, 2025

    Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Cor- ring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Floren- cio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452, 2025. 1

  10. [10]

    Multi-modal agent tuning: Building a vlm-driven agent for efficient tool usage, 2025

    Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaojian Ma, Tao Yuan, Yue Fan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, and Qing Li. Multi-modal agent tuning: Building a vlm-driven agent for efficient tool usage, 2025. 2

  11. [11]

    Webwatcher: Breaking new frontier of vision- language deep research agent, 2025

    Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, Yong Jiang, Pengjun Xie, Fei Huang, and Jin- gren Zhou. Webwatcher: Breaking new frontier of vision- language deep research agent, 2025. 3

  12. [12]

    Tora: A tool-integrated reasoning agent for mathematical problem solving, 2024

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yu- jiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving, 2024. 3

  13. [13]

    Ssl4rl: Revisit- ing self-supervised learning as intrinsic reward for visual- language reasoning, 2025

    Xiaojun Guo, Runyu Zhou, Yifei Wang, Qi Zhang, Chen- heng Zhang, Stefanie Jegelka, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, and Yisen Wang. Ssl4rl: Revisit- ing self-supervised learning as intrinsic reward for visual- language reasoning, 2025. 2

  14. [14]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 5

  15. [15]

    Mmsearch: Benchmarking the potential of large models as multi-modal search engines, 2024

    Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Chaoyou Fu, Guan- glu Song, Peng Gao, Yu Liu, Chunyuan Li, and Hongsheng Li. Mmsearch: Benchmarking the potential of large models as multi-modal search engines, 2024. 2

  16. [16]

    Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024

    Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Chaoyou Fu, Guan- glu Song, et al. Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024. 1

  17. [17]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search- r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,

  18. [18]

    Mini-o3: Scaling up reasoning patterns and interaction turns for visual search, 2025

    Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Heng- shuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search, 2025. 3

  19. [19]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 5

  20. [20]

    Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding

    Geng Li, Jinglin Xu, Yunzhen Zhao, and Yuxin Peng. Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9098–9108, 2025. 5, 6

  21. [21]

    Torl: Scaling tool-integrated rl, 2025

    Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated rl, 2025. 3

  22. [22]

    Migician: Revealing the magic of free-form multi-image grounding in multimodal large language mod- els.arXiv preprint arXiv:2501.05767, 2025

    You Li, Heyu Huang, Chi Chen, Kaiyu Huang, Chao Huang, Zonghao Guo, Zhiyuan Liu, Jinan Xu, Yuhua Li, Ruix- uan Li, et al. Migician: Revealing the magic of free-form multi-image grounding in multimodal large language mod- els.arXiv preprint arXiv:2501.05767, 2025. 1, 4

  23. [23]

    Perception, reason, think, and plan: A survey on large multimodal reasoning models, 2025

    Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhen- ran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xin- tong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, and Min Zhang. Perception, reason, think, and plan: A survey on large multimo...

  24. [24]

    Modomodo: Multi-domain data mixtures for multimodal llm reinforcement learning,

    Yiqing Liang, Jielin Qiu, Wenhao Ding, Zuxin Liu, James Tompkin, Mengdi Xu, Mengzhou Xia, Zhengzhong Tu, Laixi Shi, and Jiacheng Zhu. Modomodo: Multi-domain data mixtures for multimodal llm reinforcement learning,

  25. [25]

    Mibench: Evaluating multimodal large language models over multiple images.CoRR, 2024

    Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji Zhang, Fei Huang, Chunfeng Yuan, Bing Li, et al. Mibench: Evaluating multimodal large language models over multiple images.CoRR, 2024. 1

  26. [26]

    Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement, 2025

    Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement, 2025. 2

  27. [27]

    Visual- rft: Visual reinforcement fine-tuning, 2025

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning, 2025. 2

  28. [28]

    Mm-eureka: Ex- ploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Ex- ploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025. 2

  29. [29]

    Thinking with images.https://openai.com/index/ thinking-with-images/, 2025

    OpenAI. Thinking with images.https://openai.com/index/ thinking-with-images/, 2025. 1

  30. [30]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Er- mon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024. 2

  31. [31]

    Poster- Sum: a multimodal benchmark for scientific poster summa- rization

    Rohit Saxena, Pasquale Minervini, and Frank Keller. Poster- sum: A multimodal benchmark for scientific poster summa- rization.arXiv preprint arXiv:2502.17540, 2025. 1, 4

  32. [32]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. 2

  33. [33]

    Zoom- eye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration

    Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, and Jianwei Yin. Zoom- eye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration. InPro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6613–6629, 2025. 5, 6

  34. [34]

    Agentic reasoning and tool integration for llms via reinforcement learning.arXiv preprint arXiv:2505.01441,

    Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi. Agentic reasoning and tool integration for llms via reinforcement learning.arXiv preprint arXiv:2505.01441,

  35. [35]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 1, 3, 5, 6

  36. [36]

    OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

    Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025. 1

  37. [37]

    Qwen3-vl, 2025

    Qwen Team. Qwen3-vl, 2025. 3

  38. [38]

    Mllm-tool: A mul- timodal large language model for tool agent learning, 2025

    Chenyu Wang, Weixin Luo, Sixun Dong, Xiaohua Xuan, Zhengxin Li, Lin Ma, and Shenghua Gao. Mllm-tool: A mul- timodal large language model for tool agent learning, 2025. 2

  39. [39]

    Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning, 2023

    Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning, 2023. 3

  40. [40]

    Vrag-rl: Empower vision-perception-based rag for vi- sually rich information understanding via iterative reasoning with reinforcement learning, 2025

    Qiuchen Wang, Ruixue Ding, Yu Zeng, Zehui Chen, Lin Chen, Shihang Wang, Pengjun Xie, Fei Huang, and Feng Zhao. Vrag-rl: Empower vision-perception-based rag for vi- sually rich information understanding via iterative reasoning with reinforcement learning, 2025. 3

  41. [41]

    Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

    Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 7907–7915, 2025. 6

  42. [42]

    Vicrit: A verifiable reinforcement learning proxy task for visual perception in vlms, 2025

    Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou, Xiaoyu Liu, Ziyi Zang, Ming Li, Chung-Ching Lin, Kevin Lin, Linjie Li, Furong Huang, and Lijuan Wang. Vicrit: A verifiable reinforcement learning proxy task for visual perception in vlms, 2025. 2

  43. [43]

    Simple o3: To- wards interleaved vision-language reasoning.arXiv preprint arXiv:2508.12109, 2025

    Ye Wang, Qianglong Chen, Zejun Li, Siyuan Wang, Shi- jie Guo, Zhirui Zhang, and Zhongyu Wei. Simple o3: To- wards interleaved vision-language reasoning.arXiv preprint arXiv:2508.12109, 2025. 1

  44. [44]

    Unified multimodal chain-of-thought reward model through reinforcement fine- tuning, 2025

    Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine- tuning, 2025. 2

  45. [45]

    Mmsearch-r1: Incentivizing lmms to search, 2025

    Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incentivizing lmms to search, 2025. 3

  46. [46]

    Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use

    Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, and Klara Nahrstedt. Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use. arXiv preprint arXiv:2505.19255, 2025. 1

  47. [47]

    V?: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024. 5, 6

  48. [48]

    Synthrl: Scaling visual reasoning with verifiable data synthesis, 2025

    Zijian Wu, Jinjie Ni, Xiangyan Liu, Zichen Liu, Hang Yan, and Michael Qizhe Shieh. Synthrl: Scaling visual reasoning with verifiable data synthesis, 2025. 2

  49. [49]

    Lillicrap, Kenji Kawaguchi, and Michael Shieh

    Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P. Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative prefer- ence learning, 2024. 2

  50. [50]

    Llava-cot: Let vision language models reason step-by-step, 2025

    Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step, 2025. 2

  51. [51]

    Redstar: Does scaling long-cot data unlock better slow-reasoning systems?, 2025

    Haotian Xu, Xing Wu, Weinong Wang, Zhongzhi Li, Da Zheng, Boyuan Chen, Yi Hu, Shijia Kang, Jiaming Ji, Yingy- ing Zhang, Zhijiang Guo, Yaodong Yang, Muhan Zhang, and Debing Zhang. Redstar: Does scaling long-cot data unlock better slow-reasoning systems?, 2025. 2

  52. [52]

    Mixed-r1: Unified reward perspective for reasoning capability in multimodal large language models,

    Shilin Xu, Yanwei Li, Rui Yang, Tao Zhang, Yueyi Sun, Wei Chow, Linfeng Li, Hang Song, Qi Xu, Yunhai Tong, Xiangtai Li, and Hao Fei. Mixed-r1: Unified reward perspective for reasoning capability in multimodal large language models,

  53. [53]

    Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reason- ing.arXiv preprint arXiv:2509.02479, 2025

    Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xi- aosen Zheng, Zejun Ma, and Bo An. Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reason- ing.arXiv preprint arXiv:2509.02479, 2025. 5

  54. [54]

    Mulberry: Em- powering mllm with o1-like reasoning and reflection via col- lective monte carlo tree search, 2024

    Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, and Dacheng Tao. Mulberry: Em- powering mllm with o1-like reasoning and reflection via col- lective monte carlo tree search, 2024. 2

  55. [55]

    Rlhf-v: Towards trust- worthy mllms via behavior alignment from fine-grained cor- rectional human feedback, 2024

    Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, and Tat-Seng Chua. Rlhf-v: Towards trust- worthy mllms via behavior alignment from fine-grained cor- rectional human feedback, 2024. 2

  56. [56]

    UIOrchestra: Generating high-fidelity code from UI designs with a multi-agent sys- tem

    Chuhuai Yue, Jiajun Chai, Yufei Zhang, Zixiang Ding, Xihao Liang, Peixin Wang, Shihai Chen, Wang Yixuan, Wangyan- ping, Guojun Yin, and Wei Lin. UIOrchestra: Generating high-fidelity code from UI designs with a multi-agent sys- tem. InFindings of the Association for Computational Lin- guistics: EMNLP 2025, pages 2769–2782, Suzhou, China,

  57. [57]

    Association for Computational Linguistics. 2

  58. [58]

    Promoting efficient reason- ing with verifiable stepwise reward, 2025

    Chuhuai Yue, Chengqi Dong, Yinan Gao, Hang He, Jiajun Chai, Guojun Yin, and Wei Lin. Promoting efficient reason- ing with verifiable stepwise reward, 2025. 2

  59. [59]

    Vision-r1: Evolv- ing human-free alignment in large vision-language models via vision-guided reinforcement learning, 2025

    Yufei Zhan, Yousong Zhu, Shurong Zheng, Hongyin Zhao, Fan Yang, Ming Tang, and Jinqiao Wang. Vision-r1: Evolv- ing human-free alignment in large vision-language models via vision-guided reinforcement learning, 2025. 2

  60. [60]

    The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

    Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xi- angyuan Xue, Yijiang Li, et al. The landscape of agentic reinforcement learning for llms: A survey.arXiv preprint arXiv:2509.02547, 2025. 1

  61. [61]

    Viper: Empowering the self-evolution of visual perception abilities in vision-language model, 2025

    Juntian Zhang, Song Jin, Chuanqi Cheng, Yuhan Liu, Yankai Lin, Xun Zhang, Yufei Zhang, Fei Jiang, Guojun Yin, Wei Lin, and Rui Yan. Viper: Empowering the self-evolution of visual perception abilities in vision-language model, 2025. 2

  62. [62]

    Improve vision language model chain-of- thought reasoning, 2024

    Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, and Yiming Yang. Improve vision language model chain-of- thought reasoning, 2024. 2

  63. [63]

    Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl,

    Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xi- aowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, and Qing Li. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl,

  64. [64]

    Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

    Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xi- aowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025. 1, 3, 5, 6

  65. [65]

    MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

    Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qing- song Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024. 6

  66. [66]

    R1-reward: Train- ing multimodal reward model through stable reinforcement learning, 2025

    Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, Haojie Ding, Jiankang Chen, Fan Yang, Zhang Zhang, Tingting Gao, and Liang Wang. R1-reward: Train- ing multimodal reward model through stable reinforcement learning, 2025. 2

  67. [67]

    R1-omni: Ex- plainable omni-multimodal emotion recognition with rein- forcement learning, 2025

    Jiaxing Zhao, Xihan Wei, and Liefeng Bo. R1-omni: Ex- plainable omni-multimodal emotion recognition with rein- forcement learning, 2025. 2

  68. [68]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 1, 3, 5, 6

  69. [69]

    R1-zero’s ”aha moment” in visual reasoning on a 2b non-sft model, 2025

    Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. R1-zero’s ”aha moment” in visual reasoning on a 2b non-sft model, 2025. 2