arxiv: 2512.08980 · v3 · submitted 2025-12-05 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Training Multi-Image Vision Agents via End2End Reinforcement Learning

Chengqi Dong , Chuhuai Yue , Hang He , Rongge Mao , Fenghe Tang , S Kevin Zhou , Zekun Xu , Xiaohan Wang

show 2 more authors

Jiajun Chai Guojun Yin

Authors on Pith no claims yet

Pith reviewed 2026-05-17 00:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language modelsreinforcement learningtool usemulti-image reasoningvisual agentstrajectory maskingend-to-end trainingvisual reflection tools

0 comments

The pith

IMAgent learns effective tool use for multi-image reasoning through pure end-to-end reinforcement learning without supervised fine-tuning data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces IMAgent, an open-source visual agent for fine-grained reasoning over single or multiple images. It counters the tendency of vision-language models to ignore visual content during long reasoning by adding two tools for reflection and verification. A two-layer motion trajectory masking strategy plus tool-use reward gain lets the base model acquire a useful tool-use policy solely through reinforcement learning. The approach also includes building a new visually rich multi-image QA dataset to fill training gaps. This yields state-of-the-art results on standard benchmarks while showing how tool use improves attention to images.

Core claim

Equipped with two dedicated tools for visual reflection and verification, IMAgent trains a base VLM end-to-end via reinforcement learning. A two-layer motion trajectory masking strategy and tool-use reward gain produce an effective tool-use paradigm without any supervised fine-tuning data. The method reveals that tool usage enhances performance by maintaining attention on image content and reaches SOTA results on single- and multi-image benchmarks.

What carries the argument

Two-layer motion trajectory masking strategy together with tool-use reward gain, which together shape the reinforcement learning signal to develop and sustain tool-use behavior.

Load-bearing premise

The combination of visual reflection and verification tools with the specific masking and reward design will cause the base VLM to learn and maintain an effective tool-use policy through reinforcement learning alone.

What would settle it

An ablation experiment in which removing the two-layer masking or the tool-use reward gain causes the model to stop using the reflection tools and produces no improvement over the base VLM on multi-image tasks.

Figures

Figures reproduced from arXiv: 2512.08980 by Chengqi Dong, Chuhuai Yue, Fenghe Tang, Guojun Yin, Hang He, Jiajun Chai, Rongge Mao, S Kevin Zhou, Xiaohan Wang, Zekun Xu.

**Figure 1.** Figure 1: We compared the attention proportions of newly generated tokens by Qwen2.5-VL-7B (a, b) and IMAgent (c, d) to the input [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Figure 2: Overview of IMAgent. Our model will automatically choose whether and how to use tools based on the actual problem. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: A three-stage data construction pipeline based on multi-agent systems. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Action-Level and Trajectory-Level Mask. 3.3.2. Two-Level Mask Strategy To adapt to the training of visual tools, we design a twolevel mask strategy to enhance training stability (as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: Attention distribution across model layers. The first row [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 5.** Figure 5: Attention maps of the model using the visual confirma [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 8.** Figure 8: Demonstration of some typical tool use strategies. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 7.** Figure 7: Comparison of tools with and without trajectory masks. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Recent VLM-based agents aim to replicate OpenAI O3's "thinking with images" via tool use, yet most open-source methods restrict inputs to a single image, limiting their applicability to real-world multi-image QA tasks. To address this gap, we propose IMAgent, an open-source visual agent trained with end-to-end reinforcement learning for fine-grained single/multi-image reasoning. During inference, VLMs tend to gradually neglect visual inputs; to mitigate this issue, we design two dedicated tools for visual reflection and verification, enabling the model to actively refocus attention on image content. Beyond that, we, for the first time, reveal how tool usage enhances agent performance from an attention perspective. Equipped with a carefully designed two-layer motion trajectory masking strategy and tool-use reward gain, IMAgent acquires an effective tool-use paradigm through pure reinforcement learning, eliminating the need for costly supervised fine-tuning data. To further unleash the inherent tool-usage potential of the base VLM and fill data gaps, we construct a challenging, visually enriched multi-image QA dataset via multi-agent system. Extensive experiments validate that IMAgent achieves SOTA performance across mainstream single and multi-image benchmarks, and our in-depth analysis offers actionable insights for the community. Code and data will be released soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper trains a multi-image VLM agent end-to-end with RL using reflection tools and two-layer masking to skip SFT, but the results need full numbers and ablations to judge if the policy actually sticks.

read the letter

The main point is that IMAgent uses pure reinforcement learning on a base VLM to handle single and multi-image tasks, adding two reflection/verification tools plus a two-layer motion trajectory mask and a tool-use reward term so the model learns to refocus on images without supervised fine-tuning data. They also build a multi-image QA dataset with a multi-agent system and claim SOTA numbers plus an attention-based explanation for why the tools help.

Referee Report

3 major / 2 minor

Summary. The paper introduces IMAgent, a VLM-based agent for fine-grained single- and multi-image reasoning trained entirely via end-to-end reinforcement learning. It adds two dedicated tools (visual reflection and verification) to counteract gradual neglect of visual tokens, introduces a two-layer motion-trajectory masking strategy together with a tool-use reward gain, and claims that these components allow the base VLM to acquire an effective tool-use policy through pure RL without any supervised fine-tuning data. A multi-agent system is used to construct a visually enriched multi-image QA dataset, and the authors report SOTA results on mainstream single- and multi-image benchmarks while providing an attention-based analysis of why tool use improves performance.

Significance. If the performance claims and the causal contribution of the masking and reward components are substantiated by rigorous ablations and attention measurements, the work would constitute a meaningful step toward training multi-image VLM agents with minimal supervised data. The explicit attention-perspective analysis and the release of code and data would be additional strengths.

major comments (3)

[§4] §4 (Experiments) and associated tables: The central SOTA claim and the assertion that the two-layer masking plus tool-use reward gain enable pure-RL acquisition of tool-use policy rest on performance numbers, error bars, and ablation tables that are not visible in the provided sections. Without these, it is impossible to verify whether the reported gains are robust or whether the masking and reward components are load-bearing.
[§3.2] §3.2 (Method, attention analysis): The claim that tool usage enhances performance “from an attention perspective” and that the two-layer masking prevents gradual neglect of visual tokens requires quantitative attention-shift metrics or visualizations before and after the masking strategy; the current text supplies only qualitative description.
[§3.1–3.3] §3.1–3.3 (Reward design and masking): The tool-use reward gain and the two-layer motion-trajectory masking are presented as key enablers of credit assignment across tool calls and image tokens, yet no ablation isolating each component (e.g., performance with/without masking, with/without reward gain) is shown; such ablations are necessary to support the “eliminating the need for costly supervised fine-tuning data” claim.

minor comments (2)

[Abstract] The abstract states that code and data “will be released soon”; a concrete release plan or repository link should be added before publication.
[§3.2] Notation for the two-layer masking (e.g., which layers correspond to which trajectory segments) should be defined explicitly in §3.2 with a small diagram or pseudocode.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to provide the requested quantitative evidence, tables, and ablations.

read point-by-point responses

Referee: [§4] §4 (Experiments) and associated tables: The central SOTA claim and the assertion that the two-layer masking plus tool-use reward gain enable pure-RL acquisition of tool-use policy rest on performance numbers, error bars, and ablation tables that are not visible in the provided sections. Without these, it is impossible to verify whether the reported gains are robust or whether the masking and reward components are load-bearing.

Authors: We acknowledge the need for clear visibility of the supporting data. The full manuscript contains the SOTA results with error bars (from three independent runs) and ablation tables in Section 4. To improve accessibility, we have added a consolidated main-text table summarizing key metrics and component contributions, ensuring the robustness of the reported gains is directly verifiable. revision: yes
Referee: [§3.2] §3.2 (Method, attention analysis): The claim that tool usage enhances performance “from an attention perspective” and that the two-layer masking prevents gradual neglect of visual tokens requires quantitative attention-shift metrics or visualizations before and after the masking strategy; the current text supplies only qualitative description.

Authors: We agree that quantitative metrics strengthen the analysis. The revised manuscript now includes attention-shift metrics (average visual-token attention weight over reasoning steps) and before/after attention map visualizations. These additions quantify how the masking strategy counters visual neglect and how tool use alters attention distribution. revision: yes
Referee: [§3.1–3.3] §3.1–3.3 (Reward design and masking): The tool-use reward gain and the two-layer motion-trajectory masking are presented as key enablers of credit assignment across tool calls and image tokens, yet no ablation isolating each component (e.g., performance with/without masking, with/without reward gain) is shown; such ablations are necessary to support the “eliminating the need for costly supervised fine-tuning data” claim.

Authors: We concur that isolating each component is essential. We have added a dedicated ablation table in the revised manuscript comparing variants (no masking, no reward gain, and full model). The results demonstrate that both the two-layer masking and tool-use reward gain are necessary for successful pure-RL tool-use policy learning, directly supporting the claim regarding elimination of supervised fine-tuning data. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL training with independent dataset and rewards

full rationale

The paper presents an empirical training procedure for a multi-image VLM agent using end-to-end RL, custom reflection/verification tools, two-layer motion trajectory masking, and a tool-use reward gain. No equations, derivations, or first-principles predictions are offered that reduce performance claims to fitted parameters or self-referential definitions. The dataset is constructed separately via a multi-agent system to address data gaps, and results are validated on external benchmarks. The central claim—that these components enable effective tool-use policy acquisition without SFT—rests on experimental outcomes rather than any self-definitional or fitted-input reduction, rendering the approach self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that VLMs neglect visual inputs over long generations and on the design choices of custom tools, masking strategy, and reward shaping whose precise parameterization and interaction effects are not detailed in the abstract.

free parameters (1)

tool-use reward gain
Scaling factor in the RL reward for tool calls; value and tuning procedure not specified in abstract.

axioms (1)

domain assumption VLMs tend to gradually neglect visual inputs during inference
Invoked to motivate the design of reflection and verification tools.

invented entities (1)

IMAgent no independent evidence
purpose: End-to-end RL-trained multi-image visual agent
The model resulting from the described training procedure.

pith-pipeline@v0.9.0 · 5556 in / 1393 out tokens · 49671 ms · 2026-05-17T00:54:54.901263+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 10 internal anchors

[1]

Agentic ai: Autonomous intelligence for complex goals–a comprehensive survey.IEEe Access, 2025

Deepak Bhaskar Acharya, Karthigeyan Kuppan, and B Di- vya. Agentic ai: Autonomous intelligence for complex goals–a comprehensive survey.IEEe Access, 2025. 1

work page 2025
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272, 2025

Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272, 2025. 1, 3

work page arXiv 2025
[4]

Rlfactory: A plug- and-play reinforcement learning post-training framework for llm multi-turn tool-use, 2025

Jiajun Chai, Guojun Yin, Zekun Xu, Chuhuai Yue, Yi Jia, Siyu Xia, Xiaohan Wang, Jiwen Jiang, Xiaoguang Li, Chengqi Dong, Hang He, and Wei Lin. Rlfactory: A plug- and-play reinforcement learning post-training framework for llm multi-turn tool-use, 2025. 2

work page 2025
[5]

Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning, 2025

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning, 2025. 2

work page 2025
[6]

Insight-v: Ex- ploring long-chain visual reasoning with multimodal large language models, 2025

Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, and Ziwei Liu. Insight-v: Ex- ploring long-chain visual reasoning with multimodal large language models, 2025. 2

work page 2025
[7]

Retool: Reinforcement learning for strategic tool use in llms, 2025

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yu- jia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms, 2025. 3

work page 2025
[8]

Video-r1: Reinforcing video reasoning in mllms, 2025

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms, 2025. 2

work page 2025
[9]

Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452, 2025

Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Cor- ring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Floren- cio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452, 2025. 1

work page arXiv 2025
[10]

Multi-modal agent tuning: Building a vlm-driven agent for efficient tool usage, 2025

Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaojian Ma, Tao Yuan, Yue Fan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, and Qing Li. Multi-modal agent tuning: Building a vlm-driven agent for efficient tool usage, 2025. 2

work page 2025
[11]

Webwatcher: Breaking new frontier of vision- language deep research agent, 2025

Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, Yong Jiang, Pengjun Xie, Fei Huang, and Jin- gren Zhou. Webwatcher: Breaking new frontier of vision- language deep research agent, 2025. 3

work page 2025
[12]

Tora: A tool-integrated reasoning agent for mathematical problem solving, 2024

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yu- jiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving, 2024. 3

work page 2024
[13]

Ssl4rl: Revisit- ing self-supervised learning as intrinsic reward for visual- language reasoning, 2025

Xiaojun Guo, Runyu Zhou, Yifei Wang, Qi Zhang, Chen- heng Zhang, Stefanie Jegelka, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, and Yisen Wang. Ssl4rl: Revisit- ing self-supervised learning as intrinsic reward for visual- language reasoning, 2025. 2

work page 2025
[14]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Mmsearch: Benchmarking the potential of large models as multi-modal search engines, 2024

Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Chaoyou Fu, Guan- glu Song, Peng Gao, Yu Liu, Chunyuan Li, and Hongsheng Li. Mmsearch: Benchmarking the potential of large models as multi-modal search engines, 2024. 2

work page 2024
[16]

Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024

Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Chaoyou Fu, Guan- glu Song, et al. Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024. 1

work page arXiv 2024
[17]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search- r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Mini-o3: Scaling up reasoning patterns and interaction turns for visual search, 2025

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Heng- shuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search, 2025. 3

work page 2025
[19]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding

Geng Li, Jinglin Xu, Yunzhen Zhao, and Yuxin Peng. Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9098–9108, 2025. 5, 6

work page 2025
[21]

Torl: Scaling tool-integrated rl, 2025

Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated rl, 2025. 3

work page 2025
[22]

Migician: Revealing the magic of free-form multi-image grounding in multimodal large language mod- els.arXiv preprint arXiv:2501.05767, 2025

You Li, Heyu Huang, Chi Chen, Kaiyu Huang, Chao Huang, Zonghao Guo, Zhiyuan Liu, Jinan Xu, Yuhua Li, Ruix- uan Li, et al. Migician: Revealing the magic of free-form multi-image grounding in multimodal large language mod- els.arXiv preprint arXiv:2501.05767, 2025. 1, 4

work page arXiv 2025
[23]

Perception, reason, think, and plan: A survey on large multimodal reasoning models, 2025

Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhen- ran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xin- tong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, and Min Zhang. Perception, reason, think, and plan: A survey on large multimo...

work page 2025
[24]

Modomodo: Multi-domain data mixtures for multimodal llm reinforcement learning,

Yiqing Liang, Jielin Qiu, Wenhao Ding, Zuxin Liu, James Tompkin, Mengdi Xu, Mengzhou Xia, Zhengzhong Tu, Laixi Shi, and Jiacheng Zhu. Modomodo: Multi-domain data mixtures for multimodal llm reinforcement learning,

work page
[25]

Mibench: Evaluating multimodal large language models over multiple images.CoRR, 2024

Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji Zhang, Fei Huang, Chunfeng Yuan, Bing Li, et al. Mibench: Evaluating multimodal large language models over multiple images.CoRR, 2024. 1

work page 2024
[26]

Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement, 2025

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement, 2025. 2

work page 2025
[27]

Visual- rft: Visual reinforcement fine-tuning, 2025

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning, 2025. 2

work page 2025
[28]

Mm-eureka: Ex- ploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Ex- ploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025. 2

work page 2025
[29]

Thinking with images.https://openai.com/index/ thinking-with-images/, 2025

OpenAI. Thinking with images.https://openai.com/index/ thinking-with-images/, 2025. 1

work page 2025
[30]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Er- mon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024. 2

work page 2024
[31]

Poster- Sum: a multimodal benchmark for scientific poster summa- rization

Rohit Saxena, Pasquale Minervini, and Frank Keller. Poster- sum: A multimodal benchmark for scientific poster summa- rization.arXiv preprint arXiv:2502.17540, 2025. 1, 4

work page arXiv 2025
[32]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. 2

work page 2024
[33]

Zoom- eye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration

Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, and Jianwei Yin. Zoom- eye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration. InPro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6613–6629, 2025. 5, 6

work page 2025
[34]

Agentic reasoning and tool integration for llms via reinforcement learning.arXiv preprint arXiv:2505.01441,

Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi. Agentic reasoning and tool integration for llms via reinforcement learning.arXiv preprint arXiv:2505.01441,

work page arXiv
[35]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 1, 3, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Qwen3-vl, 2025

Qwen Team. Qwen3-vl, 2025. 3

work page 2025
[38]

Mllm-tool: A mul- timodal large language model for tool agent learning, 2025

Chenyu Wang, Weixin Luo, Sixun Dong, Xiaohua Xuan, Zhengxin Li, Lin Ma, and Shenghua Gao. Mllm-tool: A mul- timodal large language model for tool agent learning, 2025. 2

work page 2025
[39]

Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning, 2023

Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning, 2023. 3

work page 2023
[40]

Vrag-rl: Empower vision-perception-based rag for vi- sually rich information understanding via iterative reasoning with reinforcement learning, 2025

Qiuchen Wang, Ruixue Ding, Yu Zeng, Zehui Chen, Lin Chen, Shihang Wang, Pengjun Xie, Fei Huang, and Feng Zhao. Vrag-rl: Empower vision-perception-based rag for vi- sually rich information understanding via iterative reasoning with reinforcement learning, 2025. 3

work page 2025
[41]

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 7907–7915, 2025. 6

work page 2025
[42]

Vicrit: A verifiable reinforcement learning proxy task for visual perception in vlms, 2025

Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou, Xiaoyu Liu, Ziyi Zang, Ming Li, Chung-Ching Lin, Kevin Lin, Linjie Li, Furong Huang, and Lijuan Wang. Vicrit: A verifiable reinforcement learning proxy task for visual perception in vlms, 2025. 2

work page 2025
[43]

Simple o3: To- wards interleaved vision-language reasoning.arXiv preprint arXiv:2508.12109, 2025

Ye Wang, Qianglong Chen, Zejun Li, Siyuan Wang, Shi- jie Guo, Zhirui Zhang, and Zhongyu Wei. Simple o3: To- wards interleaved vision-language reasoning.arXiv preprint arXiv:2508.12109, 2025. 1

work page arXiv 2025
[44]

Unified multimodal chain-of-thought reward model through reinforcement fine- tuning, 2025

Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine- tuning, 2025. 2

work page 2025
[45]

Mmsearch-r1: Incentivizing lmms to search, 2025

Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incentivizing lmms to search, 2025. 3

work page 2025
[46]

Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use

Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, and Klara Nahrstedt. Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use. arXiv preprint arXiv:2505.19255, 2025. 1

work page arXiv 2025
[47]

V?: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024. 5, 6

work page 2024
[48]

Synthrl: Scaling visual reasoning with verifiable data synthesis, 2025

Zijian Wu, Jinjie Ni, Xiangyan Liu, Zichen Liu, Hang Yan, and Michael Qizhe Shieh. Synthrl: Scaling visual reasoning with verifiable data synthesis, 2025. 2

work page 2025
[49]

Lillicrap, Kenji Kawaguchi, and Michael Shieh

Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P. Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative prefer- ence learning, 2024. 2

work page 2024
[50]

Llava-cot: Let vision language models reason step-by-step, 2025

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step, 2025. 2

work page 2025
[51]

Redstar: Does scaling long-cot data unlock better slow-reasoning systems?, 2025

Haotian Xu, Xing Wu, Weinong Wang, Zhongzhi Li, Da Zheng, Boyuan Chen, Yi Hu, Shijia Kang, Jiaming Ji, Yingy- ing Zhang, Zhijiang Guo, Yaodong Yang, Muhan Zhang, and Debing Zhang. Redstar: Does scaling long-cot data unlock better slow-reasoning systems?, 2025. 2

work page 2025
[52]

Mixed-r1: Unified reward perspective for reasoning capability in multimodal large language models,

Shilin Xu, Yanwei Li, Rui Yang, Tao Zhang, Yueyi Sun, Wei Chow, Linfeng Li, Hang Song, Qi Xu, Yunhai Tong, Xiangtai Li, and Hao Fei. Mixed-r1: Unified reward perspective for reasoning capability in multimodal large language models,

work page
[53]

Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reason- ing.arXiv preprint arXiv:2509.02479, 2025

Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xi- aosen Zheng, Zejun Ma, and Bo An. Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reason- ing.arXiv preprint arXiv:2509.02479, 2025. 5

work page arXiv 2025
[54]

Mulberry: Em- powering mllm with o1-like reasoning and reflection via col- lective monte carlo tree search, 2024

Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, and Dacheng Tao. Mulberry: Em- powering mllm with o1-like reasoning and reflection via col- lective monte carlo tree search, 2024. 2

work page 2024
[55]

Rlhf-v: Towards trust- worthy mllms via behavior alignment from fine-grained cor- rectional human feedback, 2024

Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, and Tat-Seng Chua. Rlhf-v: Towards trust- worthy mllms via behavior alignment from fine-grained cor- rectional human feedback, 2024. 2

work page 2024
[56]

UIOrchestra: Generating high-fidelity code from UI designs with a multi-agent sys- tem

Chuhuai Yue, Jiajun Chai, Yufei Zhang, Zixiang Ding, Xihao Liang, Peixin Wang, Shihai Chen, Wang Yixuan, Wangyan- ping, Guojun Yin, and Wei Lin. UIOrchestra: Generating high-fidelity code from UI designs with a multi-agent sys- tem. InFindings of the Association for Computational Lin- guistics: EMNLP 2025, pages 2769–2782, Suzhou, China,

work page 2025
[57]

Association for Computational Linguistics. 2

work page
[58]

Promoting efficient reason- ing with verifiable stepwise reward, 2025

Chuhuai Yue, Chengqi Dong, Yinan Gao, Hang He, Jiajun Chai, Guojun Yin, and Wei Lin. Promoting efficient reason- ing with verifiable stepwise reward, 2025. 2

work page 2025
[59]

Vision-r1: Evolv- ing human-free alignment in large vision-language models via vision-guided reinforcement learning, 2025

Yufei Zhan, Yousong Zhu, Shurong Zheng, Hongyin Zhao, Fan Yang, Ming Tang, and Jinqiao Wang. Vision-r1: Evolv- ing human-free alignment in large vision-language models via vision-guided reinforcement learning, 2025. 2

work page 2025
[60]

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xi- angyuan Xue, Yijiang Li, et al. The landscape of agentic reinforcement learning for llms: A survey.arXiv preprint arXiv:2509.02547, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Viper: Empowering the self-evolution of visual perception abilities in vision-language model, 2025

Juntian Zhang, Song Jin, Chuanqi Cheng, Yuhan Liu, Yankai Lin, Xun Zhang, Yufei Zhang, Fei Jiang, Guojun Yin, Wei Lin, and Rui Yan. Viper: Empowering the self-evolution of visual perception abilities in vision-language model, 2025. 2

work page 2025
[62]

Improve vision language model chain-of- thought reasoning, 2024

Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, and Yiming Yang. Improve vision language model chain-of- thought reasoning, 2024. 2

work page 2024
[63]

Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl,

Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xi- aowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, and Qing Li. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl,

work page
[64]

Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xi- aowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025. 1, 3, 5, 6

work page internal anchor Pith review arXiv 2025
[65]

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qing- song Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

R1-reward: Train- ing multimodal reward model through stable reinforcement learning, 2025

Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, Haojie Ding, Jiankang Chen, Fan Yang, Zhang Zhang, Tingting Gao, and Liang Wang. R1-reward: Train- ing multimodal reward model through stable reinforcement learning, 2025. 2

work page 2025
[67]

R1-omni: Ex- plainable omni-multimodal emotion recognition with rein- forcement learning, 2025

Jiaxing Zhao, Xihan Wei, and Liefeng Bo. R1-omni: Ex- plainable omni-multimodal emotion recognition with rein- forcement learning, 2025. 2

work page 2025
[68]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 1, 3, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

R1-zero’s ”aha moment” in visual reasoning on a 2b non-sft model, 2025

Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. R1-zero’s ”aha moment” in visual reasoning on a 2b non-sft model, 2025. 2

work page 2025