pith. machine review for the scientific record. sign in

arxiv: 2509.07969 · v1 · pith:YEM2XMUInew · submitted 2025-09-09 · 💻 cs.CV · cs.AI· cs.CL

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Pith reviewed 2026-05-18 01:13 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords visual searchmulti-turn reasoningreinforcement learningtool-based interactionslarge multimodal modelsreasoning patternsover-turn maskingvisual probe dataset
0
0 comments X

The pith

Mini-o3 trains on six interaction turns yet produces naturally longer reasoning chains that improve accuracy on visual search tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that existing limits on interaction length and reasoning variety in tool-using multimodal models can be overcome without training directly on long sequences. It does so by building a dataset of hard visual search problems, collecting initial trajectories that display varied patterns such as depth-first search and trial-and-error, and applying an over-turn masking rule during reinforcement learning. A sympathetic reader would care because this suggests open models can tackle exploratory visual problems that currently require many back-and-forth steps. If the approach holds, performance keeps rising as the model is allowed more turns at inference time rather than plateauing at the training limit.

Core claim

Mini-o3 executes deep multi-turn reasoning spanning tens of steps on visual search tasks. It achieves this with a Visual Probe Dataset of challenging problems, an iterative pipeline that yields cold-start trajectories containing diverse patterns including depth-first search, trial-and-error, and goal maintenance, and an over-turn masking strategy in reinforcement learning that avoids penalizing responses reaching the maximum turn count. Despite training under a six-turn upper bound, the resulting model generates longer trajectories at inference time and shows rising accuracy with additional turns.

What carries the argument

Over-turn masking strategy during reinforcement learning that prevents penalization of responses hitting the turn limit, allowing test-time trajectories to exceed the six-turn training bound.

If this is right

  • Accuracy on visual search problems continues to rise as the number of allowed interaction turns increases at inference time.
  • The model produces varied reasoning patterns such as depth-first search and trial-and-error without explicit training on each pattern.
  • State-of-the-art results are reached on challenging visual search tasks that require extended exploration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The masking technique may serve as a general method to encourage longer reasoning horizons in other tool-use settings without having to train on those longer horizons.
  • The same data-collection loop could be repeated to target even deeper search behaviors on different visual or multimodal problems.
  • If the scaling holds, training compute can remain modest while inference budgets are adjusted per task difficulty.

Load-bearing premise

The iterative data collection pipeline yields cold-start trajectories whose diverse reasoning patterns transfer to longer chains without systematic bias introduced by the masking rule.

What would settle it

Measure accuracy while steadily increasing the allowed inference turns; the claim is falsified if accuracy stops rising or begins to fall after a modest number of additional turns.

read the original abstract

Recent advances in large multimodal models have leveraged image-based tools with reinforcement learning to tackle visual problems. However, existing open-source approaches often exhibit monotonous reasoning patterns and allow only a limited number of interaction turns, making them inadequate for difficult tasks that require trial-and-error exploration. In this work, we address this limitation by scaling up tool-based interactions and introduce Mini-o3, a system that executes deep, multi-turn reasoning -- spanning tens of steps -- and achieves state-of-the-art performance on challenging visual search tasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three key components. First, we construct the Visual Probe Dataset, a collection of thousands of challenging visual search problems designed for exploratory reasoning. Second, we develop an iterative data collection pipeline to obtain cold-start trajectories that exhibit diverse reasoning patterns, including depth-first search, trial-and-error, and goal maintenance. Third, we propose an over-turn masking strategy that prevents penalization of over-turn responses (those that hit the maximum number of turns) during reinforcement learning, thereby balancing training-time efficiency with test-time scalability. Despite training with an upper bound of only six interaction turns, our model generates trajectories that naturally scale to tens of turns at inference time, with accuracy improving as the number of turns increases. Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths, effectively solving challenging visual search problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces Mini-o3 for scaling tool-based multi-turn reasoning in visual search tasks with large multimodal models. It constructs the Visual Probe Dataset of challenging problems, uses an iterative pipeline to collect cold-start trajectories exhibiting diverse patterns (depth-first search, trial-and-error, goal maintenance), and applies an over-turn masking strategy during RL training. The central claim is that a model trained with a hard cap of only six interaction turns produces trajectories that naturally extend to tens of turns at inference time, with accuracy continuing to improve as turn count grows, yielding SOTA results on difficult visual search problems.

Significance. If the scaling behavior is shown to arise from transferable reasoning patterns rather than an artifact of the masking strategy, the work would provide a practical open-source recipe for longer-horizon exploratory visual reasoning, addressing current limitations of monotonous patterns and short interaction limits in multimodal agents. The dataset construction and iterative collection pipeline are concrete contributions that could be reused, though the absence of reported quantitative metrics, baselines, and ablations in the abstract limits immediate assessment of impact.

major comments (3)
  1. [Abstract] Abstract: the central scaling claim ('accuracy improving as the number of turns increases' despite a training cap of six turns) is load-bearing for the contribution yet is stated without any referenced table, figure, or quantitative result (e.g., accuracy-vs-turns curve, error bars, or comparison to a hard-stop baseline).
  2. [Abstract] Abstract (over-turn masking strategy): the claim that masking enables test-time scalability without penalizing over-turn responses during training requires an ablation (training with vs. without the mask, or with a hard stop) to demonstrate that longer productive trajectories are due to learned reasoning patterns rather than the training hack; no such experiment is described.
  3. [Abstract] Abstract (iterative data collection pipeline): the assertion that cold-start trajectories exhibit genuinely diverse and effective patterns (DFS, trial-and-error, goal maintenance) that transfer to longer chains lacks any reported metric of trajectory diversity, termination statistics, or bias analysis from the masking procedure.
minor comments (1)
  1. [Abstract] Abstract: 'state-of-the-art performance' is asserted without naming the specific benchmarks, prior open-source baselines, or exact metrics used for comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The comments correctly identify areas where the abstract could more explicitly connect to the quantitative evidence and analyses in the main text. We address each point below and have revised the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central scaling claim ('accuracy improving as the number of turns increases' despite a training cap of six turns) is load-bearing for the contribution yet is stated without any referenced table, figure, or quantitative result (e.g., accuracy-vs-turns curve, error bars, or comparison to a hard-stop baseline).

    Authors: We agree that the abstract should reference the supporting quantitative results. The main text includes Figure 3, which plots accuracy versus number of inference turns (with error bars from multiple seeds) and compares against a hard-stop baseline. In the revised manuscript we have updated the abstract to cite this figure and briefly note the observed trend of continued accuracy gains beyond the six-turn training cap. revision: yes

  2. Referee: [Abstract] Abstract (over-turn masking strategy): the claim that masking enables test-time scalability without penalizing over-turn responses during training requires an ablation (training with vs. without the mask, or with a hard stop) to demonstrate that longer productive trajectories are due to learned reasoning patterns rather than the training hack; no such experiment is described.

    Authors: This is a fair criticism. While Section 3.3 motivates the over-turn masking strategy, the initial submission did not contain a direct ablation. We have added an ablation study to the revised version (new Table 4 and accompanying text in Section 4.3) that trains an otherwise identical model without the mask and compares resulting trajectory lengths and accuracies at inference. The results indicate that masking permits longer productive chains without introducing the artifacts a hard stop would produce. revision: yes

  3. Referee: [Abstract] Abstract (iterative data collection pipeline): the assertion that cold-start trajectories exhibit genuinely diverse and effective patterns (DFS, trial-and-error, goal maintenance) that transfer to longer chains lacks any reported metric of trajectory diversity, termination statistics, or bias analysis from the masking procedure.

    Authors: We appreciate the request for quantitative support. Section 3.2 describes the iterative collection pipeline and provides qualitative examples of the patterns. To address the gap, the revised manuscript now includes a table (new Table 2) reporting the distribution of reasoning patterns across collected trajectories, termination statistics, and a short bias analysis of the masking procedure. We have also added a reference to this table in the abstract. revision: yes

Circularity Check

0 steps flagged

Empirical training pipeline shows no definitional circularity

full rationale

The paper presents an empirical recipe consisting of dataset construction, iterative trajectory collection exhibiting patterns such as depth-first search and trial-and-error, and RL training with an over-turn masking heuristic. The central claim that accuracy improves with turn count beyond the training cap of six is reported as an observed inference-time behavior on held-out visual search tasks, not as a quantity algebraically or statistically forced by the training limit or masking rule. No equations, self-definitional normalizations, or load-bearing self-citations reduce the reported scaling or accuracy gains to the fitted inputs by construction; the results remain externally falsifiable against standard benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central scaling claim rests on the unverified quality of the constructed Visual Probe Dataset and the assumption that the masking strategy preserves learning signal for longer trajectories.

free parameters (1)
  • maximum interaction turns during training
    Upper bound of six turns chosen for training efficiency; directly affects what trajectories are collected and masked.
axioms (1)
  • domain assumption The Visual Probe Dataset contains problems that elicit diverse exploratory reasoning patterns when solved by the base model.
    Invoked to justify the iterative data collection step; no external validation of dataset difficulty or pattern diversity is described.
invented entities (1)
  • over-turn masking strategy no independent evidence
    purpose: Prevents penalization of responses that hit the maximum turn limit during RL training.
    New training modification introduced to allow test-time scaling; no independent evidence of its effect outside the reported experiments.

pith-pipeline@v0.9.0 · 5791 in / 1277 out tokens · 43576 ms · 2026-05-18T01:13:09.828737+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

    eess.AS 2026-04 unverdicted novelty 7.0

    Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware ca...

  2. VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning

    cs.CV 2026-01 unverdicted novelty 7.0

    VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.

  3. Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search

    cs.CV 2026-05 unverdicted novelty 6.0

    Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...

  4. Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.

  5. Visual Reasoning through Tool-supervised Reinforcement Learning

    cs.CV 2026-04 unverdicted novelty 6.0

    ToolsRL trains MLLMs via a tool-specific then accuracy-focused RL curriculum to master visual tools for complex reasoning tasks.

  6. Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.

  7. Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

    cs.CV 2026-04 unverdicted novelty 6.0

    MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.

  8. CharTool: Tool-Integrated Visual Reasoning for Chart Understanding

    cs.AI 2026-04 unverdicted novelty 6.0

    CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.

  9. AdaTooler-V: Adaptive Tool-Use for Images and Videos

    cs.CV 2025-12 conditional novelty 6.0

    AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.

  10. Boosting Reasoning in Large Multimodal Models via Activation Replay

    cs.CV 2025-11 unverdicted novelty 6.0

    Activation Replay boosts multimodal reasoning in post-trained LMMs by replaying low-entropy activations from base models to RLVR counterparts at test time via visual token manipulation.

  11. CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

    cs.CV 2025-11 unverdicted novelty 6.0

    CropVLM uses reinforcement learning to learn image zooming policies that boost fine-grained perception in VLMs on out-of-domain high-resolution tasks without labeled boxes, synthetic data, or VLM changes.

  12. DeepEyesV2: Toward Agentic Multimodal Model

    cs.CV 2025-11 unverdicted novelty 6.0

    DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.

  13. Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement

    cs.CV 2026-04 unverdicted novelty 5.0

    Q-DeepSight proposes a think-with-image multimodal CoT framework trained via RL with perceptual curriculum rewards and evidence gradient filtering to achieve SOTA IQA performance and enable training-free perceptual re...

  14. HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

    cs.CV 2026-04 unverdicted novelty 5.0

    HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.

  15. HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

    cs.CV 2026-04 unverdicted novelty 5.0

    HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.

  16. Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

    cs.CV 2026-04 unverdicted novelty 5.0

    HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.

  17. Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs

    cs.CV 2026-03 unverdicted novelty 5.0

    A two-stage RL method with information gaps and grounding loss trains MLLMs to focus on and precisely crop relevant image regions, yielding SOTA results on high-resolution VQA benchmarks.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 16 Pith papers · 25 internal anchors

  1. [1]

    End-to-end rl training for emerging agentic capabilities, 2025

    Moonshot AI. End-to-end rl training for emerging agentic capabilities, 2025. URLhttps://moonshotai.github. io/Kimi-Researcher/

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advancesin neural information processing systems, 35:23716–23736, 2022

  3. [3]

    Claude 3.5 Sonnet

    Anthropic. Claude 3.5 Sonnet. https://www.anthropic.com/news/claude-3-5-sonnet/. Technical Report, 2024

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  5. [5]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024

  6. [6]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  7. [7]

    WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

    Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontiers of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

  8. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  9. [9]

    Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images

    Zonghao Guo, Ruyi Xu, Yuan Yao, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, and Gao Huang. Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images. InEuropean Conference on Computer Vision, pages 390–406. Springer, 2024

  10. [10]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025

  11. [11]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizingreasoningcapabilityinmultimodallargelanguagemodels. arXivpreprintarXiv:2503.06749, 2025

  12. [12]

    High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

    Xinyu Huang, Yuhao Dong, Weiwei Tian, Bo Li, Rui Feng, and Ziwei Liu. High-resolution visual reasoning via multi-turn grounding-based reinforcement learning.arXiv preprint arXiv:2507.05920, 2025

  13. [13]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  14. [14]

    Buy 4 reinforce samples, get a baseline for free! In DeepRLStructPred@ICLR, 2019

    Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 reinforce samples, get a baseline for free! In DeepRLStructPred@ICLR, 2019. URLhttps://api.semanticscholar.org/CorpusID:198489118

  15. [15]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  16. [16]

    Aria: An open multimodal native mixture-of-experts model.arXiv preprint arXiv:2410.05993, 2024

    Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Fan Zhou, Chengen Huang, Yanpeng Li, et al. Aria: An open multimodal native mixture-of-experts model.arXiv preprint arXiv:2410.05993, 2024

  17. [17]

    Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding, 2025

    Geng Li, Jinglin Xu, Yunzhen Zhao, and Yuxin Peng. Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding, 2025. URLhttps://arxiv.org/abs/2504.14920

  18. [18]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023. 12

  19. [19]

    WebSailor: Navigating Super-human Reasoning for Web Agent

    Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, et al. Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025

  20. [20]

    Remax: A simple, effective, and efficient method for aligning large language models

    Ziniu Li, Tian Xu, Yushun Zhang, Yang Yu, RUoyu Sun, and Zhi-Quan Luo. Remax: A simple, effective, and efficient method for aligning large language models. 2023

  21. [21]

    Vila: On pre-training for visual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024

  22. [22]

    Visual instruction tuning.Advances in neural information processing systems, 36, 2024

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024

  23. [23]

    Understanding r1-zero-like training: A critical perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. InConference on Language Modeling (COLM), 2025

  24. [24]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025

  25. [25]

    Agent rl scaling law: Agent rl with spontaneous code execution for mathematical problem solving.arXiv preprint arXiv:2505.07773, 2025

    Xinji Mai, Haotian Xu, Weinong Wang, Jian Hu, Yingying Zhang, Wenqiang Zhang, et al. Agent rl scaling law: Agent rl with spontaneous code execution for mathematical problem solving.arXiv preprint arXiv:2505.07773, 2025

  26. [26]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm- eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2503.07365, 2025

  27. [27]

    Llama 3.2: Revolutionizing edge AI and vision with open, customizable models.https://ai.meta.com/ blog/llama-3-2-connect-2024-vision-edge-mobile-devices/

    Meta. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models.https://ai.meta.com/ blog/llama-3-2-connect-2024-vision-edge-mobile-devices/. Technical Report, 2024

  28. [28]

    Introducing o3 and o4-mini, 2025

    OpenAI. Introducing o3 and o4-mini, 2025. URLhttps://openai.com/index/introducing-o3-and-o4-mini/

  29. [29]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

  30. [30]

    Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models.CoRR, 2024

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models.CoRR, 2024

  31. [31]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  32. [32]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

  33. [33]

    Webshaper: Agentically data synthesizing via information-seeking formalization.arXiv preprint arXiv:2507.15061, 2025

    Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, et al. Webshaper: Agentically data synthesizing via information-seeking formalization.arXiv preprint arXiv:2507.15061, 2025

  34. [34]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  35. [35]

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Ha...

  36. [36]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...

  37. [37]

    Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

    Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7907–7915, 2025

  38. [38]

    Chain-of-thought prompting elicits reasoning in large language models.Advancesin neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advancesin neural information processing systems, 35:24824–24837, 2022

  39. [39]

    Williams

    Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256, 1992

  40. [40]

    MMSearch-R1: Incentivizing LMMs to Search

    Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025

  41. [41]

    V?: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

  42. [42]

    Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning, 2025

    Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, and Bo An. Simpletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning, 2025. URLhttps://arxiv.org/abs/2509.02479

  43. [43]

    Visionthink: Smart and efficient vision language model via reinforcement learning.arXiv preprint arXiv:2507.13348, 2025

    Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, and Jiaya Jia. Visionthink: Smart and efficient vision language model via reinforcement learning.arXiv preprint arXiv:2507.13348, 2025

  44. [44]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

  45. [45]

    R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

    Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025

  46. [46]

    Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

    Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025

  47. [47]

    MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

    Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024. 14

  48. [48]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URL https: //arxiv.org/abs/2507.18071

  49. [49]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

  50. [50]

    aha moment

    Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. R1-zero’s" aha moment" in visual reasoning on a 2b non-sft model.arXiv preprint arXiv:2503.05132, 2025

  51. [51]

    PRAKING”. ... I can see a sign on the right side of the road, below a traffic light. ... It is likely that this sign has the text

    Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, Mingyu Liu, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, et al. Active-o3: Empowering multimodal large language models with active perception via grpo. arXiv preprint arXiv:2505.21457, 2025. 15 Appendix A More illustrations of multi-turn trajectories Turn1: The user is asking for the direction of...