Recognition: no theorem link
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
Pith reviewed 2026-05-10 18:15 UTC · model grok-4.3
The pith
MAPO requires models to explicitly describe visual tool outputs in text and rewards alignment between those descriptions and actual observations to close the gap between reasoning and actions in multimodal agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By mandating explicit textual descriptions for visual content obtained via tool usage and employing a novel advantage estimation that couples the semantic alignment between these descriptions and the actual observations with the task reward, MAPO bridges the reasoning-action gap in multimodal chain-of-thought, reduces gradient variance, and achieves superior performance on visual reasoning benchmarks.
What carries the argument
MAPO's advantage estimation mechanism, which integrates semantic alignment scores between textual descriptions of tool outputs and the actual visual observations into the standard task reward.
If this is right
- Models will execute more accurate visual actions because mismatches between text and observation lower the advantage signal.
- Training trajectories become more stable as the method inherently reduces the variance of policy gradients.
- Performance gains appear across multiple visual reasoning benchmarks that involve multi-turn tool usage.
- The method can be applied to any multimodal agent that interleaves reasoning text with visual tool calls.
Where Pith is reading between the lines
- Extending this alignment requirement to other tool types, such as code execution or web search, could similarly improve consistency in non-visual agents.
- Future work might explore whether the explicit descriptions can be used for additional supervision signals beyond the advantage estimate.
- Applying MAPO in real-world settings with noisy observations would test if the semantic alignment remains robust.
Load-bearing premise
The assumption that generating and aligning explicit textual descriptions of visual observations will reliably reduce the reasoning-action discrepancy without creating new failure modes in how models describe or interpret images.
What would settle it
Running the same training setup but measuring the frequency of visual action errors that still occur despite high task rewards; if MAPO does not reduce those errors compared to standard RL, the bridging claim would not hold.
Figures
read the original abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have incentivized models to ``think with images'' by actively invoking visual tools during multi-turn reasoning. The common Reinforcement Learning (RL) practice of relying on outcome-based rewards ignores the fact that textual plausibility often masks executive failure, meaning that models may exhibit intuitive textual reasoning while executing imprecise or irrelevant visual actions within their agentic reasoning trajectories. This reasoning-action discrepancy introduces noise that accumulates throughout the multi-turn reasoning process, severely degrading the model's multimodal reasoning capabilities and potentially leading to training collapse. In this paper, we introduce Multimodal Agentic Policy Optimization (MAPO), bridging the gap between textual reasoning and visual actions generated by models within their Multimodal Chain-of-Thought (MCoT). Specifically, MAPO mandates the model to generate explicit textual descriptions for the visual content obtained via tool usage. We then employ a novel advantage estimation that couples the semantic alignment between these descriptions and the actual observations with the task reward. Theoretical findings are provided to justify the rationale behind MAPO, which inherently reduces the variance of gradients, and extensive experiments demonstrate that our method achieves superior performance across multiple visual reasoning benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Multimodal Agentic Policy Optimization (MAPO) for MLLMs performing visual tool use in multi-turn reasoning. It mandates explicit textual descriptions of tool-obtained visual content within MCoT trajectories and proposes a novel advantage estimator that multiplies or combines a semantic alignment score (between mandated descriptions and actual observations) with the task reward. Theoretical analysis is claimed to show this reduces gradient variance relative to standard outcome-based RL; experiments reportedly show superior performance on visual reasoning benchmarks.
Significance. If the empirical gains and variance-reduction guarantee hold after addressing implementation details, the work would be significant for agentic multimodal systems: it directly targets the common failure mode where fluent textual reasoning masks imprecise visual actions. The requirement for explicit descriptions plus the coupled advantage estimator is a concrete, testable mechanism that could generalize beyond the reported benchmarks. The presence of a theoretical justification is a strength worth preserving.
major comments (2)
- [§3.2] §3.2 (Advantage Estimation): the central variance-reduction claim rests on the semantic-alignment term in the advantage estimator. The manuscript must specify the exact alignment function (embedding model, VLM judge, or other), its training/fine-tuning status, and any temperature or threshold hyperparameters. Without this, it is impossible to verify that the term does not introduce systematic bias or domain-mismatch noise in complex scenes, which would invalidate the theoretical guarantee that the estimator reduces variance relative to pure outcome rewards.
- [§4] §4 (Theoretical Analysis): the proof that the coupled estimator reduces gradient variance must explicitly bound the contribution of the alignment score. If the alignment metric is itself learned or noisy, the derivation should show that the overall variance is still strictly lower than the baseline; otherwise the claim reduces to an empirical observation rather than a theoretical result.
minor comments (2)
- [Table 1] Table 1 and §5.1: report the exact semantic-alignment scores alongside task rewards for a few example trajectories so readers can see the magnitude of the coupling term.
- [§5.3] §5.3 (Ablations): an ablation that removes the mandatory description generation (while keeping the same reward) would isolate whether the explicit-description mandate itself contributes to the reported gains or whether the gains are carried entirely by the advantage estimator.
Simulated Author's Rebuttal
We thank the referee for the valuable feedback. We respond to each major comment below and have updated the manuscript to address the concerns about implementation details and theoretical bounds.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Advantage Estimation): the central variance-reduction claim rests on the semantic-alignment term in the advantage estimator. The manuscript must specify the exact alignment function (embedding model, VLM judge, or other), its training/fine-tuning status, and any temperature or threshold hyperparameters. Without this, it is impossible to verify that the term does not introduce systematic bias or domain-mismatch noise in complex scenes, which would invalidate the theoretical guarantee that the estimator reduces variance relative to pure outcome rewards.
Authors: We agree that these details are necessary for verifying the claims and ensuring no unaccounted bias. In the revised manuscript, we have added to §3.2 the exact specification of the alignment function as a pre-trained VLM used to compute semantic similarity between the textual description and the visual observation. The model is not fine-tuned on task-specific data, and we use a temperature of 1.0 with no threshold applied. We include a discussion on why this choice minimizes domain mismatch and provide empirical evidence from our experiments that the term enhances rather than degrades performance. revision: yes
-
Referee: [§4] §4 (Theoretical Analysis): the proof that the coupled estimator reduces gradient variance must explicitly bound the contribution of the alignment score. If the alignment metric is itself learned or noisy, the derivation should show that the overall variance is still strictly lower than the baseline; otherwise the claim reduces to an empirical observation rather than a theoretical result.
Authors: We appreciate the suggestion to make the theoretical analysis more rigorous. We have expanded §4 in the revised manuscript to explicitly bound the alignment score's contribution. We assume the alignment score is bounded in [0,1] with finite variance, which holds for standard similarity metrics. Under the condition that the alignment score is positively correlated with the task reward (ensured by the explicit description requirement), we prove that the variance of the coupled advantage estimator is strictly lower than that of the pure outcome-based estimator. The updated derivation accounts for potential noise in the alignment metric and preserves the theoretical result. revision: yes
Circularity Check
MAPO derivation is self-contained with no reduction to inputs by construction
full rationale
The paper introduces MAPO as a policy optimization method that mandates textual descriptions of tool outputs and couples semantic alignment with task reward in a novel advantage estimator, supported by theoretical findings on gradient variance reduction. No equations, derivations, or self-citations are presented in the abstract or described claims that reduce the advantage estimation or variance-reduction result to fitted parameters, prior self-work, or definitional equivalence. The central claim rests on the independent novelty of the coupling mechanism and its empirical validation on benchmarks, without any load-bearing step that collapses to the inputs by construction. This is the typical honest non-finding for a methods paper whose theory is stated as justification rather than tautological.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions of reinforcement learning for policy optimization hold for multimodal agent trajectories.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025
work page internal anchor Pith review arXiv 2025
-
[3]
Visual sketchpad: Sketching as a visual chain of thought for multimodal language models
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InAdvances in Neural Information Processing Systems, pages 139348–139379, 2024
2024
-
[4]
Thinking with images.https://openai.com/index/thinking-with-images/, 2025
OpenAI. Thinking with images.https://openai.com/index/thinking-with-images/, 2025
2025
-
[5]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025. 10 APREPRINT
work page internal anchor Pith review arXiv 2025
-
[6]
Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025
work page internal anchor Pith review arXiv 2025
-
[7]
Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025
-
[8]
Wenhao Yang, Yu Xia, Jinlong Huang, Shiyin Lu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Yuanyu Wan, and Lijun Zhang. Deep but reliable: Advancing multi-turn reasoning for thinking with images.arXiv preprint arXiv:2512.17306, 2025
-
[9]
Llavanext: Improved reasoning, ocr, and world knowledge.https://llava-vl.github.io/blog/ 2024-01-30-llava-next/, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge.https://llava-vl.github.io/blog/ 2024-01-30-llava-next/, 2024
2024
-
[10]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025
Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, et al. Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025
- [12]
-
[13]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, pages 8748–8763, 2021
2021
-
[14]
BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InProceedings of the 39th International Conference on Machine Learning, pages 12888–12900, 2022
2022
-
[15]
BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InProceedings of the 40th International Conference on Machine Learning, pages 19730–19742, 2023
2023
-
[16]
Sigmoid loss for language image pre- training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre- training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023
2023
-
[17]
Show and tell: A neural image caption generator
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015
2015
-
[18]
Show, attend and tell: Neural image caption generation with visual attention
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. InProceedings of the 32nd International Conference on Machine Learning, pages 2048–2057, 2015
2048
-
[19]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bi´nk...
2022
-
[20]
Visual instruction tuning
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, pages 34892–34916, 2023
2023
-
[21]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...
1901
-
[22]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Bap- tiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 11 APREPRINT
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...
2022
-
[24]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024
2024
-
[25]
How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, page 220101, 2024
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, page 220101, 2024
2024
-
[26]
Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023
2023
-
[27]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2...
2024
-
[30]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024
-
[33]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Multimodal Chain-of-Thought Reasoning in Language Models
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of- thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023
work page internal anchor Pith review arXiv 2023
-
[35]
Zhitao He, Sandeep Polisetty, Zhiyuan Fan, Yuchen Huang, Shujin Wu, and Yi R. Fung. MMBoundary: Ad- vancing MLLM knowledge boundary awareness through reasoning step confidence calibration. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16427–16444, 2025
2025
-
[36]
Chuming Shen, Wei Wei, Xiaoye Qu, and Yu Cheng. Satori-r1: Incentivizing multimodal reasoning with spatial grounding and verifiable rewards.arXiv preprint arXiv:2505.19094, 2025
-
[37]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023
work page internal anchor Pith review arXiv 2023
-
[38]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, pages 68539–68551, 2023. 12 APREPRINT
2023
-
[39]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023
2023
-
[40]
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action.arXiv preprint arXiv:2303.11381, 2023
work page internal anchor Pith review arXiv 2023
-
[41]
Visual programming: Compositional visual reasoning without train- ing
Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without train- ing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953– 14962, 2023
2023
-
[42]
REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization
Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025
work page internal anchor Pith review arXiv 2025
-
[43]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[44]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025
work page internal anchor Pith review arXiv 2025
-
[47]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025
-
[49]
Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025
Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, et al. Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025
-
[50]
Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, et al. Webshaper: Agentically data synthesizing via information-seeking formalization. arXiv preprint arXiv:2507.15061, 2025
-
[51]
arXiv preprint arXiv:2505.07773 , year=
Xinji Mai, Haotian Xu, Zhong-Zhi Li, Weinong Wang, Jian Hu, Yingying Zhang, Wenqiang Zhang, et al. Agent rl scaling law: Agent rl with spontaneous code execution for mathematical problem solving.arXiv preprint arXiv:2505.07773, 2025
-
[52]
Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, and Bo An. Simpletir: End-to- end reinforcement learning for multi-turn tool-integrated reasoning.arXiv preprint arXiv:2509.02479, 2025
-
[53]
Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents
Vardaan Pahuja, Yadong Lu, Corby Rosset, Boyu Gou, Arindam Mitra, Spencer Whitehead, Yu Su, and Ahmed Hassan Awadallah. Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: A...
2025
-
[54]
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Understanding self-evolution in llm agents via multi-turn rein- forcement learning.arXiv preprint arXiv:2504.20073, 2025
work page internal anchor Pith review arXiv 2025
-
[55]
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025
work page internal anchor Pith review arXiv 2025
-
[56]
arXiv preprint arXiv:2510.24514 , year=
Huanyu Zhang, Wenshan Wu, Chengzu Li, Ning Shang, Yan Xia, Yangyu Huang, Yifan Zhang, Li Dong, Zhang Zhang, Liang Wang, et al. Latent sketchpad: Sketching visual thoughts to elicit multimodal reasoning in mllms. arXiv preprint arXiv:2510.24514, 2025
-
[57]
Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, and Yisen Wang. Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025. 13 APREPRINT
-
[58]
Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025a
Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025
-
[59]
Shuai Dong, Siyuan Wang, Xingyu Liu, Chenglin Li, Haowen Hou, and Zhongyu Wei. Interleaved latent visual reasoning with selective perceptual modeling.arXiv preprint arXiv:2512.05665, 2025
-
[60]
Deepeyesv2: Toward agentic multimodal model
Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025
-
[61]
Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl. arXiv preprint arXiv:2505.15436, 2025
-
[62]
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel- space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025
work page internal anchor Pith review arXiv 2025
-
[63]
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
Xinyu Huang, Yuhao Dong, Weiwei Tian, Bo Li, Rui Feng, and Ziwei Liu. High-resolution visual reasoning via multi-turn grounding-based reinforcement learning.arXiv preprint arXiv:2507.05920, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025
Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incen- tivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025
-
[65]
Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, and Jiaya Jia. Visionthink: Smart and efficient vision language model via reinforcement learning.arXiv preprint arXiv:2507.13348, 2025
-
[66]
Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, Mingyu Liu, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, et al. Active-o3: Empowering multimodal large language models with active perception via grpo.arXiv preprint arXiv:2505.21457, 2025
-
[67]
Zirun Guo, Minjie Hong, Feng Zhang, Kai Jia, and Tao Jin. Thinking with programming vision: Towards a unified view for thinking with images.arXiv preprint arXiv:2512.03746, 2025
-
[68]
Vacot: Rethinking visual data augmentation with vlms.arXiv preprint arXiv:2512.02361, 2025
Zhengzhuo Xu, Chong Sun, SiNan Du, Chen Li, Jing Lyu, and Chun Yuan. Vacot: Rethinking visual data augmentation with vlms.arXiv preprint arXiv:2512.02361, 2025
-
[69]
arXiv preprint arXiv:2602.12916 , year=
Haobin Li, Yutong Yang, Yijie Lin, Dai Xiang, Mouxing Yang, and Xi Peng. Reliable thinking with images. arXiv preprint arXiv:2602.12916, 2026
-
[70]
Hybridflow: A flexible and efficient rlhf framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, page 1279–1297, 2025
2025
-
[71]
Momentum-based variance reduction in non-convex sgd
Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd. InAdvances in Neural Information Processing Systems, volume 32, 2019
2019
-
[72]
Gpt-5 system card.https://cdn.openai.com/gpt-5-system-card.pdf, 2025
OpenAI. Gpt-5 system card.https://cdn.openai.com/gpt-5-system-card.pdf, 2025
2025
-
[73]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[74]
V?: Guided visual search as a core mechanism in multimodal llms
Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024
2024
-
[75]
Divide, conquer and combine: a training-free framework for high-resolution image perception in multimodal large lan- guage models
Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: a training-free framework for high-resolution image perception in multimodal large lan- guage models. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applicatio...
2025
-
[76]
Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024. 14 APREPRINT Appendix of “Walk the Talk: Bridging the Reasoning-...
-
[77]
Variance due to Policy Sampling (First Term):The termE r[ˆg|τ] =∇θ logπ(τ)·E[r|τ]represents the expected gradient direction for a specific trajectory. Assuming the semantic rewardr sem is calibrated to align with the task objective (i.e., it correlates positively withr out), the expected gradients for both methods point in similar directions: E[rout|τ]∝E[...
-
[78]
Varr(ˆgout|τ) =∥∇ θ logπ(τ)∥ 2 ·Var(r out) =C τ ·σ 2 out (21) Varr(ˆgsem|τ) =∥∇ θ logπ(τ)∥ 2 ·Var(r sem) =C τ ·σ 2 sem (22) whereC τ =∥∇ θ logπ(τ)∥ 2
Variance due to Reward Noise (Second Term):The second term,E τ[Varr(ˆg|τ)], represents the variance due to reward signal noise for afixedtrajectory. Varr(ˆgout|τ) =∥∇ θ logπ(τ)∥ 2 ·Var(r out) =C τ ·σ 2 out (21) Varr(ˆgsem|τ) =∥∇ θ logπ(τ)∥ 2 ·Var(r sem) =C τ ·σ 2 sem (22) whereC τ =∥∇ θ logπ(τ)∥ 2. Sinceσ 2 sem < σ 2 out, it follows that: Eτ[Varr(ˆgsem|τ)...
-
[79]
The objective functionJ(θ)isL-smooth
-
[80]
The gradient estimatorˆgis unbiased:E[ˆg] =∇J(θ)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.