pith. machine review for the scientific record. sign in

arxiv: 2604.06777 · v1 · submitted 2026-04-08 · 💻 cs.CV

Recognition: no theorem link

Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords MAPOmultimodal chain-of-thoughtvisual reasoningreinforcement learningagentic policy optimizationreasoning-action gapMLLM tool useadvantage estimation
0
0 comments X

The pith

MAPO requires models to explicitly describe visual tool outputs in text and rewards alignment between those descriptions and actual observations to close the gap between reasoning and actions in multimodal agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models often produce plausible text reasoning but fail in the visual actions they take when using tools, causing errors to build up over multiple turns. The paper proposes Multimodal Agentic Policy Optimization (MAPO) to fix this by making the model write out what it sees in the tool results and then using a reward that checks how well those words match the real visual content alongside the final task success. This approach is shown to lower the variance in training signals and deliver better results on visual reasoning tests. Readers should care because reliable visual tool use is key to making AI systems that can truly think step by step with images rather than just talk about them.

Core claim

By mandating explicit textual descriptions for visual content obtained via tool usage and employing a novel advantage estimation that couples the semantic alignment between these descriptions and the actual observations with the task reward, MAPO bridges the reasoning-action gap in multimodal chain-of-thought, reduces gradient variance, and achieves superior performance on visual reasoning benchmarks.

What carries the argument

MAPO's advantage estimation mechanism, which integrates semantic alignment scores between textual descriptions of tool outputs and the actual visual observations into the standard task reward.

If this is right

  • Models will execute more accurate visual actions because mismatches between text and observation lower the advantage signal.
  • Training trajectories become more stable as the method inherently reduces the variance of policy gradients.
  • Performance gains appear across multiple visual reasoning benchmarks that involve multi-turn tool usage.
  • The method can be applied to any multimodal agent that interleaves reasoning text with visual tool calls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending this alignment requirement to other tool types, such as code execution or web search, could similarly improve consistency in non-visual agents.
  • Future work might explore whether the explicit descriptions can be used for additional supervision signals beyond the advantage estimate.
  • Applying MAPO in real-world settings with noisy observations would test if the semantic alignment remains robust.

Load-bearing premise

The assumption that generating and aligning explicit textual descriptions of visual observations will reliably reduce the reasoning-action discrepancy without creating new failure modes in how models describe or interpret images.

What would settle it

Running the same training setup but measuring the frequency of visual action errors that still occur despite high task rewards; if MAPO does not reduce those errors compared to standard RL, the bridging claim would not hold.

Figures

Figures reproduced from arXiv: 2604.06777 by Jinlong Huang, Kaifu Zhang, Lijun Zhang, Qing-Guo Chen, Shiyin Lu, Tat-Seng Chua, Weihua Luo, Wenhao Yang, Xiaobo Xia, Yuanyu Wan, Yuchen Zhou, Yu Xia, Zhao Xu.

Figure 1
Figure 1. Figure 1: Left: Overview of the agentic “thinking with images” framework where the MLLM agent executes visual actions guided by textual reasoning. Right: Training curves of the semantic score demonstrating that MAPO signifi￾cantly outperforms other RL methods. Nowadays, driven by the remarkable efficacy of Reinforcement Learning (RL) in agentic reasoning, the field has converged on a standard training pipeline: cold… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of MAPO. Our method bridges the reasoning-action gap by using a CLIP model to measure the semantic alignment between self-generated labels and observations. These signals are integrated into the advantage estimation to enforce process consistency. This score provides a dense supervision signal, quantifying the alignment between the text description and the visual content. Furthermore, this verific… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of Multimodal Reasoning Trajectories. We visualize the intermediate reasoning steps of MAPO (left) and GRPO (right) for the same query. While GRPO fails to execute the visual action implied by its text, MAPO successfully aligns the visual execution of MLLM agent with its textual reasoning, effectively bridging the reasoning-action gap. source models: GPT-5 [72] and Gemini 2.5 Pro [73]; (ii) o… view at source ↗
Figure 4
Figure 4. Figure 4: Reward Curves of MAPO and GRPO. Unlike GRPO, which degrades in later stages, MAPO maintains stable learning, validating its scalability for multimodal agents large-scale training . Scalability of Agentic Training. One of our motivation for MAPO is addressing the instability inherent in train￾ing multimodal agents with sparse rewards [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
read the original abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have incentivized models to ``think with images'' by actively invoking visual tools during multi-turn reasoning. The common Reinforcement Learning (RL) practice of relying on outcome-based rewards ignores the fact that textual plausibility often masks executive failure, meaning that models may exhibit intuitive textual reasoning while executing imprecise or irrelevant visual actions within their agentic reasoning trajectories. This reasoning-action discrepancy introduces noise that accumulates throughout the multi-turn reasoning process, severely degrading the model's multimodal reasoning capabilities and potentially leading to training collapse. In this paper, we introduce Multimodal Agentic Policy Optimization (MAPO), bridging the gap between textual reasoning and visual actions generated by models within their Multimodal Chain-of-Thought (MCoT). Specifically, MAPO mandates the model to generate explicit textual descriptions for the visual content obtained via tool usage. We then employ a novel advantage estimation that couples the semantic alignment between these descriptions and the actual observations with the task reward. Theoretical findings are provided to justify the rationale behind MAPO, which inherently reduces the variance of gradients, and extensive experiments demonstrate that our method achieves superior performance across multiple visual reasoning benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Multimodal Agentic Policy Optimization (MAPO) for MLLMs performing visual tool use in multi-turn reasoning. It mandates explicit textual descriptions of tool-obtained visual content within MCoT trajectories and proposes a novel advantage estimator that multiplies or combines a semantic alignment score (between mandated descriptions and actual observations) with the task reward. Theoretical analysis is claimed to show this reduces gradient variance relative to standard outcome-based RL; experiments reportedly show superior performance on visual reasoning benchmarks.

Significance. If the empirical gains and variance-reduction guarantee hold after addressing implementation details, the work would be significant for agentic multimodal systems: it directly targets the common failure mode where fluent textual reasoning masks imprecise visual actions. The requirement for explicit descriptions plus the coupled advantage estimator is a concrete, testable mechanism that could generalize beyond the reported benchmarks. The presence of a theoretical justification is a strength worth preserving.

major comments (2)
  1. [§3.2] §3.2 (Advantage Estimation): the central variance-reduction claim rests on the semantic-alignment term in the advantage estimator. The manuscript must specify the exact alignment function (embedding model, VLM judge, or other), its training/fine-tuning status, and any temperature or threshold hyperparameters. Without this, it is impossible to verify that the term does not introduce systematic bias or domain-mismatch noise in complex scenes, which would invalidate the theoretical guarantee that the estimator reduces variance relative to pure outcome rewards.
  2. [§4] §4 (Theoretical Analysis): the proof that the coupled estimator reduces gradient variance must explicitly bound the contribution of the alignment score. If the alignment metric is itself learned or noisy, the derivation should show that the overall variance is still strictly lower than the baseline; otherwise the claim reduces to an empirical observation rather than a theoretical result.
minor comments (2)
  1. [Table 1] Table 1 and §5.1: report the exact semantic-alignment scores alongside task rewards for a few example trajectories so readers can see the magnitude of the coupling term.
  2. [§5.3] §5.3 (Ablations): an ablation that removes the mandatory description generation (while keeping the same reward) would isolate whether the explicit-description mandate itself contributes to the reported gains or whether the gains are carried entirely by the advantage estimator.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the valuable feedback. We respond to each major comment below and have updated the manuscript to address the concerns about implementation details and theoretical bounds.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Advantage Estimation): the central variance-reduction claim rests on the semantic-alignment term in the advantage estimator. The manuscript must specify the exact alignment function (embedding model, VLM judge, or other), its training/fine-tuning status, and any temperature or threshold hyperparameters. Without this, it is impossible to verify that the term does not introduce systematic bias or domain-mismatch noise in complex scenes, which would invalidate the theoretical guarantee that the estimator reduces variance relative to pure outcome rewards.

    Authors: We agree that these details are necessary for verifying the claims and ensuring no unaccounted bias. In the revised manuscript, we have added to §3.2 the exact specification of the alignment function as a pre-trained VLM used to compute semantic similarity between the textual description and the visual observation. The model is not fine-tuned on task-specific data, and we use a temperature of 1.0 with no threshold applied. We include a discussion on why this choice minimizes domain mismatch and provide empirical evidence from our experiments that the term enhances rather than degrades performance. revision: yes

  2. Referee: [§4] §4 (Theoretical Analysis): the proof that the coupled estimator reduces gradient variance must explicitly bound the contribution of the alignment score. If the alignment metric is itself learned or noisy, the derivation should show that the overall variance is still strictly lower than the baseline; otherwise the claim reduces to an empirical observation rather than a theoretical result.

    Authors: We appreciate the suggestion to make the theoretical analysis more rigorous. We have expanded §4 in the revised manuscript to explicitly bound the alignment score's contribution. We assume the alignment score is bounded in [0,1] with finite variance, which holds for standard similarity metrics. Under the condition that the alignment score is positively correlated with the task reward (ensured by the explicit description requirement), we prove that the variance of the coupled advantage estimator is strictly lower than that of the pure outcome-based estimator. The updated derivation accounts for potential noise in the alignment metric and preserves the theoretical result. revision: yes

Circularity Check

0 steps flagged

MAPO derivation is self-contained with no reduction to inputs by construction

full rationale

The paper introduces MAPO as a policy optimization method that mandates textual descriptions of tool outputs and couples semantic alignment with task reward in a novel advantage estimator, supported by theoretical findings on gradient variance reduction. No equations, derivations, or self-citations are presented in the abstract or described claims that reduce the advantage estimation or variance-reduction result to fitted parameters, prior self-work, or definitional equivalence. The central claim rests on the independent novelty of the coupling mechanism and its empirical validation on benchmarks, without any load-bearing step that collapses to the inputs by construction. This is the typical honest non-finding for a methods paper whose theory is stated as justification rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based solely on abstract; full details unavailable. The approach rests on standard RL policy optimization assumptions plus the unstated premise that semantic alignment between generated descriptions and observations can be reliably quantified and used for advantage estimation.

axioms (1)
  • domain assumption Standard assumptions of reinforcement learning for policy optimization hold for multimodal agent trajectories.
    The method extends RL practices to MCoT without stating new proofs.

pith-pipeline@v0.9.0 · 5557 in / 1234 out tokens · 55636 ms · 2026-05-10T18:15:59.575164+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

82 extracted references · 49 canonical work pages · 25 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  2. [2]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

  3. [3]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InAdvances in Neural Information Processing Systems, pages 139348–139379, 2024

  4. [4]

    Thinking with images.https://openai.com/index/thinking-with-images/, 2025

    OpenAI. Thinking with images.https://openai.com/index/thinking-with-images/, 2025

  5. [5]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025. 10 APREPRINT

  6. [6]

    Thyme: Think Beyond Images

    Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025

  7. [7]

    Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025

    Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025

  8. [8]

    Deep but reliable: Advancing multi-turn reasoning for thinking with images.arXiv preprint arXiv:2512.17306, 2025

    Wenhao Yang, Yu Xia, Jinlong Huang, Shiyin Lu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Yuanyu Wan, and Lijun Zhang. Deep but reliable: Advancing multi-turn reasoning for thinking with images.arXiv preprint arXiv:2512.17306, 2025

  9. [9]

    Llavanext: Improved reasoning, ocr, and world knowledge.https://llava-vl.github.io/blog/ 2024-01-30-llava-next/, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge.https://llava-vl.github.io/blog/ 2024-01-30-llava-next/, 2024

  10. [10]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

  11. [11]

    Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025

    Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, et al. Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025

  12. [12]

    Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, et al. Ovis2. 5 technical report.arXiv preprint arXiv:2508.11737, 2025

  13. [13]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, pages 8748–8763, 2021

  14. [14]

    BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InProceedings of the 39th International Conference on Machine Learning, pages 12888–12900, 2022

  15. [15]

    BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InProceedings of the 40th International Conference on Machine Learning, pages 19730–19742, 2023

  16. [16]

    Sigmoid loss for language image pre- training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre- training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023

  17. [17]

    Show and tell: A neural image caption generator

    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015

  18. [18]

    Show, attend and tell: Neural image caption generation with visual attention

    Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. InProceedings of the 32nd International Conference on Machine Learning, pages 2048–2057, 2015

  19. [19]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bi´nk...

  20. [20]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, pages 34892–34916, 2023

  21. [21]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...

  22. [22]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Bap- tiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 11 APREPRINT

  23. [23]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

  24. [24]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

  25. [25]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, page 220101, 2024

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, page 220101, 2024

  26. [26]

    Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

  27. [27]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024

  28. [28]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  29. [29]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2...

  30. [30]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

  31. [31]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  32. [32]

    Ovis: Structural embedding alignment for multimodal large language model, 2024.arXiv preprint arXiv:2405.20797, 2024

    Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024

  33. [33]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  34. [34]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of- thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023

  35. [35]

    Zhitao He, Sandeep Polisetty, Zhiyuan Fan, Yuchen Huang, Shujin Wu, and Yi R. Fung. MMBoundary: Ad- vancing MLLM knowledge boundary awareness through reasoning step confidence calibration. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16427–16444, 2025

  36. [36]

    Satori-r1: Incentivizing multimodal reasoning with spatial grounding and verifiable rewards.arXiv preprint arXiv:2505.19094, 2025

    Chuming Shen, Wei Wei, Xiaoye Qu, and Yu Cheng. Satori-r1: Incentivizing multimodal reasoning with spatial grounding and verifiable rewards.arXiv preprint arXiv:2505.19094, 2025

  37. [37]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

  38. [38]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, pages 68539–68551, 2023. 12 APREPRINT

  39. [39]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023

  40. [40]

    MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023

  41. [41]

    Visual programming: Compositional visual reasoning without train- ing

    Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without train- ing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953– 14962, 2023

  42. [42]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025

  43. [43]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  44. [44]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  45. [45]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  46. [46]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

  47. [47]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

  48. [48]

    Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

    Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

  49. [49]

    Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025

    Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, et al. Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025

  50. [50]

    Webshaper: Agentically data synthesizing via information-seeking formalization.arXiv preprint arXiv:2507.15061, 2025

    Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, et al. Webshaper: Agentically data synthesizing via information-seeking formalization. arXiv preprint arXiv:2507.15061, 2025

  51. [51]

    arXiv preprint arXiv:2505.07773 , year=

    Xinji Mai, Haotian Xu, Zhong-Zhi Li, Weinong Wang, Jian Hu, Yingying Zhang, Wenqiang Zhang, et al. Agent rl scaling law: Agent rl with spontaneous code execution for mathematical problem solving.arXiv preprint arXiv:2505.07773, 2025

  52. [52]

    Sim- pletir: End-to-end reinforcement learning for multi-turn tool-integrated reasoning.arXiv preprint arXiv:2509.02479,

    Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, and Bo An. Simpletir: End-to- end reinforcement learning for multi-turn tool-integrated reasoning.arXiv preprint arXiv:2509.02479, 2025

  53. [53]

    Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents

    Vardaan Pahuja, Yadong Lu, Corby Rosset, Boyu Gou, Arindam Mitra, Spencer Whitehead, Yu Su, and Ahmed Hassan Awadallah. Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: A...

  54. [54]

    RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

    Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Understanding self-evolution in llm agents via multi-turn rein- forcement learning.arXiv preprint arXiv:2504.20073, 2025

  55. [55]

    Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

    Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025

  56. [56]

    arXiv preprint arXiv:2510.24514 , year=

    Huanyu Zhang, Wenshan Wu, Chengzu Li, Ning Shang, Yan Xia, Yangyu Huang, Yifan Zhang, Li Dong, Zhang Zhang, Liang Wang, et al. Latent sketchpad: Sketching visual thoughts to elicit multimodal reasoning in mllms. arXiv preprint arXiv:2510.24514, 2025

  57. [57]

    Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025

    Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, and Yisen Wang. Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025. 13 APREPRINT

  58. [58]

    Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025a

    Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025

  59. [59]

    Interleaved latent visual reasoning with selective perceptual modeling.arXiv preprint arXiv:2512.05665, 2025

    Shuai Dong, Siyuan Wang, Xingyu Liu, Chenglin Li, Haowen Hou, and Zhongyu Wei. Interleaved latent visual reasoning with selective perceptual modeling.arXiv preprint arXiv:2512.05665, 2025

  60. [60]

    Deepeyesv2: Toward agentic multimodal model

    Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025

  61. [61]

    Adaptive chain-of-focus reasoning via dynamic visual search and zooming for efficient vlms.arXiv preprint arXiv:2505.15436, 2025

    Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl. arXiv preprint arXiv:2505.15436, 2025

  62. [62]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel- space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

  63. [63]

    High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

    Xinyu Huang, Yuhao Dong, Weiwei Tian, Bo Li, Rui Feng, and Ziwei Liu. High-resolution visual reasoning via multi-turn grounding-based reinforcement learning.arXiv preprint arXiv:2507.05920, 2025

  64. [64]

    Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025

    Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incen- tivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025

  65. [65]

    Visionthink: Smart and efficient vision lan- guage model via reinforcement learning.arXiv preprint arXiv:2507.13348, 2025

    Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, and Jiaya Jia. Visionthink: Smart and efficient vision language model via reinforcement learning.arXiv preprint arXiv:2507.13348, 2025

  66. [66]

    Active-o3: Empowering multimodal large language models with active perception via grpo.arXiv preprint arXiv:2505.21457, 2025

    Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, Mingyu Liu, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, et al. Active-o3: Empowering multimodal large language models with active perception via grpo.arXiv preprint arXiv:2505.21457, 2025

  67. [67]

    Thinking with programming vision: Towards a unified view for thinking with images.arXiv preprint arXiv:2512.03746, 2025

    Zirun Guo, Minjie Hong, Feng Zhang, Kai Jia, and Tao Jin. Thinking with programming vision: Towards a unified view for thinking with images.arXiv preprint arXiv:2512.03746, 2025

  68. [68]

    Vacot: Rethinking visual data augmentation with vlms.arXiv preprint arXiv:2512.02361, 2025

    Zhengzhuo Xu, Chong Sun, SiNan Du, Chen Li, Jing Lyu, and Chun Yuan. Vacot: Rethinking visual data augmentation with vlms.arXiv preprint arXiv:2512.02361, 2025

  69. [69]

    arXiv preprint arXiv:2602.12916 , year=

    Haobin Li, Yutong Yang, Yijie Lin, Dai Xiang, Mouxing Yang, and Xi Peng. Reliable thinking with images. arXiv preprint arXiv:2602.12916, 2026

  70. [70]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, page 1279–1297, 2025

  71. [71]

    Momentum-based variance reduction in non-convex sgd

    Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd. InAdvances in Neural Information Processing Systems, volume 32, 2019

  72. [72]

    Gpt-5 system card.https://cdn.openai.com/gpt-5-system-card.pdf, 2025

    OpenAI. Gpt-5 system card.https://cdn.openai.com/gpt-5-system-card.pdf, 2025

  73. [73]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  74. [74]

    V?: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

  75. [75]

    Divide, conquer and combine: a training-free framework for high-resolution image perception in multimodal large lan- guage models

    Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: a training-free framework for high-resolution image perception in multimodal large lan- guage models. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applicatio...

  76. [76]

    Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024

    Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024. 14 APREPRINT Appendix of “Walk the Talk: Bridging the Reasoning-...

  77. [77]

    Variance due to Policy Sampling (First Term):The termE r[ˆg|τ] =∇θ logπ(τ)·E[r|τ]represents the expected gradient direction for a specific trajectory. Assuming the semantic rewardr sem is calibrated to align with the task objective (i.e., it correlates positively withr out), the expected gradients for both methods point in similar directions: E[rout|τ]∝E[...

  78. [78]

    Varr(ˆgout|τ) =∥∇ θ logπ(τ)∥ 2 ·Var(r out) =C τ ·σ 2 out (21) Varr(ˆgsem|τ) =∥∇ θ logπ(τ)∥ 2 ·Var(r sem) =C τ ·σ 2 sem (22) whereC τ =∥∇ θ logπ(τ)∥ 2

    Variance due to Reward Noise (Second Term):The second term,E τ[Varr(ˆg|τ)], represents the variance due to reward signal noise for afixedtrajectory. Varr(ˆgout|τ) =∥∇ θ logπ(τ)∥ 2 ·Var(r out) =C τ ·σ 2 out (21) Varr(ˆgsem|τ) =∥∇ θ logπ(τ)∥ 2 ·Var(r sem) =C τ ·σ 2 sem (22) whereC τ =∥∇ θ logπ(τ)∥ 2. Sinceσ 2 sem < σ 2 out, it follows that: Eτ[Varr(ˆgsem|τ)...

  79. [79]

    The objective functionJ(θ)isL-smooth

  80. [80]

    The gradient estimatorˆgis unbiased:E[ˆg] =∇J(θ)

Showing first 80 references.