Recognition: 3 theorem links
· Lean TheoremDeepEyesV2: Toward Agentic Multimodal Model
Pith reviewed 2026-05-16 05:28 UTC · model grok-4.3
The pith
A two-stage cold-start followed by reinforcement learning induces robust tool-use behavior in multimodal models where direct reinforcement learning fails.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Direct reinforcement learning alone fails to induce robust tool-use behavior. This motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns on a diverse, moderately challenging dataset that includes examples where tool use is beneficial, followed by a reinforcement learning stage that refines tool invocation. DeepEyesV2 exhibits task-adaptive tool use, tends to select image operations for perception and numerical computations for reasoning, supports complex tool combinations, and achieves strong performance on RealX-Bench and other benchmarks covering real-world understanding, mathematical reasoning, and search-intensive tasks.
What carries the argument
Two-stage training pipeline in which cold-start first establishes basic tool-use patterns and reinforcement learning then refines invocation decisions for context and task demands.
Load-bearing premise
The curated dataset of moderately challenging examples where tool use is beneficial will produce tool-use patterns that generalize beyond the specific tasks and benchmarks used in training.
What would settle it
Training an identical model architecture with direct reinforcement learning from the start and obtaining comparable or higher tool-use robustness and benchmark scores than the two-stage pipeline on RealX-Bench.
read the original abstract
Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to further refine tool invocation. We curate a diverse, moderately challenging training dataset, specifically including examples where tool use is beneficial. We further introduce RealX-Bench, a comprehensive benchmark designed to evaluate real-world multimodal reasoning, which inherently requires the integration of multiple capabilities, including perception, search, and reasoning. We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2 exhibits task-adaptive tool invocation, tending to use image operations for perception tasks and numerical computations for reasoning tasks. Reinforcement learning further enables complex tool combinations and allows model to selectively invoke tools based on context. We hope our study can provide guidance for community in developing agentic multimodal models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DeepEyesV2, an agentic multimodal model that actively invokes external tools (code execution, web search) during reasoning. It reports that direct reinforcement learning fails to produce robust tool-use behavior, motivating a two-stage pipeline: cold-start supervised training on a curated dataset of moderately challenging examples where tool use is beneficial, followed by reinforcement learning to refine invocation patterns. The authors introduce RealX-Bench to evaluate integrated real-world multimodal reasoning and claim that DeepEyesV2 achieves strong performance on this benchmark plus others, with task-adaptive tool selection (image operations for perception, numerical tools for reasoning) and the ability to combine tools contextually.
Significance. If the empirical claims hold after proper quantification, the work would offer practical guidance on training agentic multimodal models by demonstrating the insufficiency of direct RL and the utility of a staged cold-start-plus-RL approach. The introduction of RealX-Bench targets an important gap in benchmarks that require perception-search-reasoning integration. However, the current manuscript supplies no numerical results, ablations, dataset statistics, or error bars, so the significance cannot yet be assessed beyond the conceptual framing.
major comments (3)
- [Abstract and §4] Abstract and §4 (Evaluation): The central claim that direct RL fails while the two-stage pipeline succeeds is load-bearing for the entire contribution, yet the text provides no quantitative metrics, baseline comparisons, ablation results on the cold-start dataset, or error bars. Without these, the reported task-adaptive behavior and benchmark gains cannot be verified.
- [§3.2] §3.2 (Dataset Curation): The assumption that the curated 'moderately challenging' dataset induces generalizable tool-use rules rather than task-specific heuristics is untested. No held-out task distributions, OOD splits, or ablation on dataset composition are described, leaving open whether observed adaptivity transfers beyond RealX-Bench overlap.
- [§4.3] §4.3 (Tool Invocation Analysis): The observation that RL enables complex tool combinations and selective invocation is presented as a key outcome, but no concrete examples, frequency statistics, or comparison to the cold-start stage are supplied to substantiate the refinement effect.
minor comments (2)
- [§4.1] The description of RealX-Bench lacks basic statistics (number of examples, task categories, construction protocol) that would allow readers to judge its coverage and difficulty.
- [§3.1] Notation for tool categories (e.g., 'image operations' vs. 'numerical computations') is used without a clear taxonomy or examples in the main text.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The comments correctly identify that the current manuscript lacks sufficient quantitative support for its central claims. We have revised the paper to include the requested metrics, ablations, and analyses, which we believe substantially strengthen the empirical grounding of the work.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Evaluation): The central claim that direct RL fails while the two-stage pipeline succeeds is load-bearing for the entire contribution, yet the text provides no quantitative metrics, baseline comparisons, ablation results on the cold-start dataset, or error bars. Without these, the reported task-adaptive behavior and benchmark gains cannot be verified.
Authors: We agree that quantitative evidence is required to substantiate the core claim. In the revised manuscript we have added a new Table 2 in §4 that reports performance of direct RL versus the two-stage pipeline on RealX-Bench and three additional benchmarks, together with ablations isolating the cold-start stage. All results now include standard deviations over three random seeds. These additions directly support the statement that direct RL fails to produce robust tool-use behavior while the staged approach succeeds. revision: yes
-
Referee: [§3.2] §3.2 (Dataset Curation): The assumption that the curated 'moderately challenging' dataset induces generalizable tool-use rules rather than task-specific heuristics is untested. No held-out task distributions, OOD splits, or ablation on dataset composition are described, leaving open whether observed adaptivity transfers beyond RealX-Bench overlap.
Authors: We acknowledge the concern. The revised §3.2 now describes an OOD test split constructed from task categories deliberately excluded from the cold-start data. We also added an ablation that removes entire task families from the training set and measures the resulting drop in tool-use accuracy on held-out distributions. The new results indicate that the learned invocation patterns transfer beyond the training distribution, although the transfer is not perfect; we report the quantitative gaps explicitly. revision: yes
-
Referee: [§4.3] §4.3 (Tool Invocation Analysis): The observation that RL enables complex tool combinations and selective invocation is presented as a key outcome, but no concrete examples, frequency statistics, or comparison to the cold-start stage are supplied to substantiate the refinement effect.
Authors: We have expanded §4.3 with three concrete examples of multi-tool sequences that appear only after RL, a frequency table comparing tool-combination counts before and after the RL stage, and a side-by-side comparison of invocation selectivity metrics between the cold-start checkpoint and the final model. These additions make the refinement effect observable and quantifiable. revision: yes
Circularity Check
No circularity: empirical training pipeline with external benchmarks
full rationale
The paper presents an empirical study of a two-stage training process (cold-start SFT on a curated dataset of tool-beneficial examples followed by RL) for inducing tool-use in multimodal models, evaluated on RealX-Bench and other standard benchmarks. No mathematical derivations, equations, or first-principles predictions are claimed. Performance claims rest on experimental results rather than any reduction to fitted parameters or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked; the work is self-contained against external benchmarks and does not rename known results as novel organization.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 18 Pith papers
-
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
-
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning
V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.
-
TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables
TableVision benchmark shows explicit spatial grounding recovers MLLM reasoning on hierarchical tables, delivering 12.3% accuracy improvement through a decoupled perception-reasoning framework.
-
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
-
DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents
DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
-
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch
POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
-
Towards Long-horizon Agentic Multimodal Search
LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp a...
-
AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning
AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.
-
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
-
CharTool: Tool-Integrated Visual Reasoning for Chart Understanding
CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.
-
Perceptual Flow Network for Visually Grounded Reasoning
PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
-
Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning
A framework with similarity-based visual token compression, dynamic attention rebalancing, and explicit inductive-deductive chain-of-thought improves multimodal ICL performance across eight benchmarks for open-source VLMs.
-
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs
SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.
-
SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition
SAKE is an agentic framework for GMNER that uses uncertainty-based self-awareness and reinforcement learning to balance internal knowledge exploitation with adaptive external exploration.
-
Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement
Q-DeepSight proposes a think-with-image multimodal CoT framework trained via RL with perceptual curriculum rewards and evidence gradient filtering to achieve SOTA IQA performance and enable training-free perceptual re...
-
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
-
OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks
OpenVLThinkerV2 applies a new Gaussian GRPO training objective with response and entropy shaping to outperform prior open-source and proprietary models on 18 visual reasoning benchmarks.
-
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.
Reference graph
Works this paper leans on
-
[1]
Tallyqa: Answering complex counting questions
Manoj Acharya, Kushal Kafle, and Christopher Kanan. Tallyqa: Answering complex counting questions. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 8076–8084, 2019
work page 2019
- [2]
-
[3]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Qwen2.5-vl: A family of vision-language models from 7b to 72b.arXiv preprint arXiv:2502.04567, 2025
Junjie Bai, Jiayi Wei, Zhiwei Guo, Ziyu Zhou, et al. Qwen2.5-vl: A family of vision-language models from 7b to 72b.arXiv preprint arXiv:2502.04567, 2025
-
[5]
Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models.arXiv preprint arXiv:2504.11468, 2025
-
[6]
Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025
-
[7]
Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming- Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions?arXiv preprint arXiv:2302.11713, 2023
-
[8]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024
work page 2024
-
[9]
Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, et al. Simplevqa: Multimodal factuality evaluation for multimodal large language models.arXiv preprint arXiv:2502.13059, 2025
-
[10]
Gabriel Comanici, Aakanksha Chowdhery, Richard Sutton, et al. Gemini 2.5 pro: Scaling agen- tic multimodal reasoning with retrieval and code execution.arXiv preprint arXiv:2502.07012, 2025. 13
-
[11]
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv e-prints, pages arXiv–2409, 2024
work page 2024
-
[12]
Vlmevalkit: An open-source toolkit for evaluating large multi-modality models
Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM international conference on multimedia, pages 11198–11201, 2024
work page 2024
-
[13]
GRIT: Teaching MLLMs to Think with Images
Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images. arXiv preprint arXiv:2505.15879, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv preprint arXiv:2504.11536, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Vita: Towards open-source interactive omni multimodal llm
Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Yuhang Dai, Meng Zhao, Yi-Fan Zhang, Shaoqi Dong, Yangze Li, Xiong Wang, et al. Vita: Towards open-source interactive omni multimodal llm.arXiv preprint arXiv:2408.05211, 2024
-
[16]
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontiers of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluat- ing real-world omnimodal understanding for multimodal llms.arXiv preprint arXiv:2502.04326, 2025
-
[18]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Jitesh Jain, Zhengyuan Yang, Humphrey Shi, Jianfeng Gao, and Jianwei Yang. Ola-vlm: Elevating visual perception in multimodal llms with auxiliary embedding distillation.arXiv preprint arXiv:2412.09585, 2024
-
[20]
Chaoya Jiang, Yongrui Heng, Wei Ye, Han Yang, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, and Shikun Zhang. Vlm-r3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.arXiv preprint arXiv:2505.16192, 2025
-
[21]
Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Chaoyou Fu, Guanglu Song, et al. Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024
-
[22]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025
-
[24]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension
Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension.arXiv preprint arXiv:2404.16790, 2024. 14
-
[26]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023
work page 2023
-
[27]
Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models.arXiv preprint arXiv:2403.00231, 2024
-
[28]
Ming Li, Jike Zhong, Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Yuxiang Lai, Chen Wei, Konstantinos Psounis, and Kaipeng Zhang. Tir-bench: A comprehensive benchmark for agentic thinking-with-images reasoning.arXiv preprint arXiv:2511.01833, 2025
-
[29]
Baichuan-omni-1.5 technical report,
Yadong Li, Jun Liu, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, et al. Baichuan-omni-1.5 technical report.arXiv preprint arXiv:2501.15368, 2025
-
[30]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Zhenzhi Li, Yichi Zhang, Haoran Duan, Yizhou Zhang, et al. Seed-bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024
work page 2024
-
[32]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
work page 2023
-
[33]
Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024
work page 2024
-
[34]
Visual agentic reinforcement fine-tuning.arXiv preprint arXiv:2505.14246, 2025
Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual agentic reinforcement fine-tuning.arXiv preprint arXiv:2505.14246, 2025
-
[35]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[36]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[38]
Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning.CoRR, 2025
Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, et al. Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning.CoRR, 2025
work page 2025
-
[39]
OpenAI. Thinking with images. https://openai.com/index/thinking-with-images/, 2025
work page 2025
-
[40]
Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024
-
[41]
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 15
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
SeekWorld: Geolocation is a natural RL task for o3- like visual clue-tracking
Kaibin Tian, Zijie Xin, and Jiazhen Liu. SeekWorld: Geolocation is a natural RL task for o3- like visual clue-tracking. https://github.com/TheEighthDay/SeekWorld, 2025. GitHub repository
work page 2025
-
[44]
Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025
-
[45]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
VGR: Visual Grounded Reasoning
Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, et al. Vgr: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Measuring multimodal mathematical reasoning with math-vision dataset
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Systems, 37:95095–95169, 2024
work page 2024
-
[48]
Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning.arXiv preprint arXiv:2310.03731, 2023
-
[49]
Qiuchen Wang, Ruixue Ding, Yu Zeng, Zehui Chen, Lin Chen, Shihang Wang, Pengjun Xie, Fei Huang, and Feng Zhao. Vrag-rl: Empower vision-perception-based rag for visually rich information understanding via iterative reasoning with reinforcement learning.arXiv preprint arXiv:2505.22019, 2025
-
[50]
Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7907–7915, 2025
work page 2025
-
[51]
Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025
-
[52]
Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024
work page 2024
-
[53]
Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025
Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025
-
[54]
V?: Guided visual search as a core mechanism in multimodal llms
Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024
work page 2024
-
[55]
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts
Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186. Springer, 2024. 16
work page 2024
-
[58]
Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025
-
[59]
Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024
work page internal anchor Pith review arXiv 2024
-
[61]
Jiaxing Zhao, Xihan Wei, and Liefeng Bo. R1-omni: Explainable omni-multimodal emotion recognition with reinforcement learning.arXiv preprint arXiv:2503.05379, 2025
-
[62]
Pyvision: Agentic vision with dynamic tooling.arXiv preprint arXiv:2507.07998, 2025
Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, and Chen Wei. Pyvision: Agentic vision with dynamic tooling.arXiv preprint arXiv:2507.07998, 2025
-
[63]
Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments.arXiv preprint arXiv:2504.03160, 2025
-
[64]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[65]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 17 A Appendix A.1 Training Data For perception-oriented tasks, we include V* [54], ArxivQA [27], Pi...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Code Execution.Code execution covers a set of operations that require Python-based execution. We further divide it into four subtypes: •Crop:extract a specific region of the input image for fine-grained analysis. cropped = image_1 . crop (( top , left , right , bottom ) ) plt . imshow ( cropped ) plt . axis ( ’ off ’) plt . show () • Numerical Analysis:pe...
-
[67]
Image Search.Given an image query, we utilize SerpAPI to retrieve visually similar results from the web, returning candidate images with thumbnails
-
[68]
Text Search.Based on a textual query, we retrieve relevant webpages and provides both titles and snippets of content. Model of the dark blue car Cannot be determined Q:What is the specific model of the car in the image? A:Unknown GT:Dongfeng Honda Error Reason: Model called the wrong tool (Text Search instead of Image Search) Tool Selection ErrorTool Exec...
work page 2007
-
[69]
**python** will respond with the output of the execution or time out after 300.0 seconds
**python** can be called to analyze the image. **python** will respond with the output of the execution or time out after 300.0 seconds
-
[70]
Like jupyter notebook, you can use Python code to process the input image and use "plt.show()"to visualize processed images in your code
-
[71]
All python code are running in the same jupyter notebook kernel, which means the functions and variables are automatically stored after code execution
-
[72]
Do not write infinite loop in your code
You program should always returns in finite time. Do not write infinite loop in your code
-
[73]
Writing file to disk is not allowed. ##search You are provided with function signatures within<tools></tools>XML tags: <tool_call> {"type":"function", "function": { "name": "image_search", 23 "description": "Retrieves top 10 images and descriptions from Google’s image search using the original image. Should only be used once.", }, { "name": "search", "des...
-
[74]
You MUST engage in many interactions, delving deeply into the topic to explore all possible aspects until a satisfactory answer is found
-
[75]
Before presenting a Final Answer, you will **cross-check** and **validate the informa- tion** you’ve gathered to confirm its accuracy and reliability
-
[76]
You will carefully analyze each information source to ensure that all data is current, relevant, and from credible origins
-
[77]
If you need to perform multiple searches, please do so in the next round
Please note that you can **only** call search once at a time. If you need to perform multiple searches, please do so in the next round
-
[78]
You can **only** conduct image search once. USER_PROMPT {Question} You must put your answer inside <answer> </answer> tags, i.e., <answer> answer here </answer> . Please reason step by step. Use Python code to process the image if necessary. You can conduct search to seek the Inter- net. Format strictly as <think> </think> <code> </code> (if code is neede...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.