pith. machine review for the scientific record. sign in

arxiv: 2604.08545 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

Guannan Zhang, Hongwei Xue, Jintao Tong, Kunyu Shi, Ruixuan Li, Shilin Yan, Xiaojun Tang, Yangyang Wang, Yixiong Zou

Pith reviewed 2026-05-10 18:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords agentic multimodal modelsmeta-cognitive tool useHDPOconditional advantage estimationtool efficiencyreinforcement learningMetis modelorthogonal optimization channels
0
0 comments X

The pith

By reframing tool efficiency as a conditional objective rather than a competing reward, multimodal agents can achieve both fewer tool calls and higher accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current agentic multimodal models overuse external tools even when visual input alone suffices, creating latency and noise that harms reasoning. Standard reinforcement learning penalties for tool use create a dilemma because aggressive penalties block needed calls while mild ones get lost in accuracy reward variance. The paper introduces a framework that keeps accuracy maximization separate from efficiency enforcement, applying the latter only on trajectories that already solve the task correctly. This separation lets the agent first master correct answers and then develop judgment about when to invoke tools. The resulting model shows tool use dropping by orders of magnitude alongside gains in reasoning performance.

Core claim

HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness across all trajectories, and an efficiency channel that applies conditional advantage estimation to enforce execution economy only within accurate trajectories, thereby inducing meta-cognitive behavior that reduces tool invocations while raising accuracy.

What carries the argument

HDPO, the framework that decouples accuracy maximization from efficiency enforcement via conditional advantage estimation applied exclusively on accurate trajectories.

If this is right

  • The agent follows a natural curriculum, achieving task correctness before refining tool economy.
  • Tool invocations decrease by orders of magnitude on the same tasks.
  • Reasoning accuracy rises at the same time, removing the usual efficiency-accuracy trade-off.
  • No scalar penalty is needed, eliminating the dilemma of reward variance swallowing the efficiency signal.
  • The agent learns to arbitrate between internal knowledge and external utilities without reflexive tool calls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same conditional-channel approach could extend to other multi-objective agent settings where one goal must not interfere with another, such as speed versus completeness.
  • Deployment latency in interactive multimodal applications would drop substantially once tool calls are limited to genuinely necessary cases.
  • Testing the same separation on language-only agents could reveal whether meta-cognitive self-reliance emerges in non-visual domains.
  • If the curriculum effect holds, future work might deliberately stage training to first saturate accuracy before activating efficiency constraints.

Load-bearing premise

The two optimization channels remain independent in practice and conditional advantage estimation on accurate trajectories alone suffices to produce the desired meta-cognitive behavior without introducing instabilities or biases.

What would settle it

A side-by-side evaluation on tasks solvable from raw visual context alone, showing either persistent high tool usage rates or a drop in accuracy when the efficiency channel is active, would disprove the claim.

Figures

Figures reproduced from arXiv: 2604.08545 by Guannan Zhang, Hongwei Xue, Jintao Tong, Kunyu Shi, Ruixuan Li, Shilin Yan, Xiaojun Tang, Yangyang Wang, Yixiong Zou.

Figure 1
Figure 1. Figure 1: Comparison of tool-use efficiency and task performance. Existing methods rely heavily on tool calls, reflecting limited efficiency awareness. In contrast, our method uses tools far more selectively while achieving the best overall performance, showing that strong accuracy and high efficiency can be attained simultaneously. transcend the inherent limitations of static parametric knowledge. This approach has… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between coupled-reward optimization and HDPO. Existing methods entangle accuracy and efficiency into a single reward signal, while HDPO decouples them into separate branches and combines them only at the final loss, enabling more strategic tool use. to first master task resolution before refining its self-reliance. Crucially, recognizing that strategic RL requires a high-fidelity environment, we… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Metis. A strategic multimodal reasoning agent that selectively invokes code execution, text search, and image search tools during multi-turn reasoning. Rather than invoking tools by default, Metis adaptively determines when tool interactions provide genuinely useful evidence, and otherwise reasons directly from the available context to obtain the final answer. While seemingly straightforward, t… view at source ↗
Figure 4
Figure 4. Figure 4: Direct reasoning from visual context. The query can be resolved through visual un￾derstanding and prior knowledge alone. Metis abstains from tool invocation and answers directly, exemplifying the meta-cognitive restraint instilled by HDPO. Sensitivity to Efficiency Loss Weight. We further investigate the impact of the tool-efficiency weight wtool. A conservative weight (wtool=0.10) provides clear improveme… view at source ↗
Figure 5
Figure 5. Figure 5: Targeted code execution for fine-grained visual analysis. The question requires comparing curves in a specific subplot region that is difficult to resolve at the original image scale. Metis invokes code to crop and enlarge the relevant area, enabling precise identification of the curve behavior near the queried time step. 4.4 Meta-Cognitive Tool Arbitration To complement the quantitative evaluation, we pre… view at source ↗
Figure 6
Figure 6. Figure 6: System prompt used for metis. The prompt defines the available tools, their invocation formats, and decision guidelines, encouraging the model to answer directly whenever possible and to invoke external tools only when they provide genuinely useful information. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Direct reasoning from visual inspection. The on-screen text is clearly legible from the raw image. Metis correctly extracts the answer without invoking code execution or search tools, avoiding unnecessary computational overhead. Selective Search Tool Invocation. Figures 8 and 9 illustrate cases where the visual input alone is insufficient and external knowledge retrieval becomes genuinely necessary. In [P… view at source ↗
Figure 8
Figure 8. Figure 8: Strategic image search for visual identification. The artwork cannot be reliably identified from visual features alone. Metis invokes image search to match the visual content against external references, then retrieves the completion year from the search results. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Strategic text search for factual knowledge. While the monument is visually identifiable, the queried measurement (cella width) cannot be inferred from the image. Metis recognizes this epistemic gap and invokes text search to retrieve the precise factual information from external sources. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
read the original abstract

The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose HDPO, a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum-compelling the agent to first master task resolution before refining its self-reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that current agentic multimodal models exhibit a meta-cognitive deficit leading to excessive tool invocation even when internal knowledge suffices. It identifies an optimization dilemma in scalarized RL rewards for tool efficiency and proposes HDPO, which decouples accuracy maximization (on all trajectories) from efficiency enforcement (via conditional advantage estimation only on accurate trajectories). This is said to induce a cognitive curriculum and yield the Metis model, which reduces tool invocations by orders of magnitude while increasing reasoning accuracy.

Significance. If the central claims hold and the orthogonality of channels is empirically validated without hidden interactions or selection bias, the work would represent a meaningful methodological advance for training efficient agentic systems. The reframing of efficiency as a strictly conditional objective rather than a competing scalar term could influence RL design for tool-using agents more broadly, particularly in multimodal settings where latency and noise from unnecessary tool calls are costly.

major comments (2)
  1. [Abstract] Abstract: The description of HDPO provides no equations, pseudocode, or derivation for the conditional advantage estimation procedure or the two orthogonal channels. Without these, it is impossible to verify whether the accuracy and efficiency signals remain decoupled in practice or whether they interact through the shared policy and value networks, which is load-bearing for the claim that scalarization is avoided.
  2. [Abstract] Abstract: The manuscript states that 'extensive evaluations demonstrate' orders-of-magnitude reduction in tool use with simultaneous accuracy gains, yet supplies no accuracy rates on trajectories, difficulty stratification, advantage variance within the conditioned subset, or ablation results. This leaves the central empirical claim unverifiable and does not address the risk that conditioning efficiency solely on accurate trajectories could introduce selection bias on low-complexity queries.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed report. The comments highlight important aspects of clarity in the HDPO description and the presentation of empirical results. We address each major comment below and indicate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The description of HDPO provides no equations, pseudocode, or derivation for the conditional advantage estimation procedure or the two orthogonal channels. Without these, it is impossible to verify whether the accuracy and efficiency signals remain decoupled in practice or whether they interact through the shared policy and value networks, which is load-bearing for the claim that scalarization is avoided.

    Authors: We agree that the abstract, due to length constraints, presents HDPO at a conceptual level without equations or pseudocode. The full manuscript (Section 3) defines the two channels explicitly: the accuracy channel computes standard advantages over all trajectories to maximize correctness, while the efficiency channel computes conditional advantages exclusively on accurate trajectories (i.e., advantage masked by an accuracy indicator function). This formulation avoids scalarization by design, as the efficiency term is never added to the accuracy reward and does not alter value estimates for inaccurate trajectories. Shared policy and value networks are used, but the objectives remain orthogonal because efficiency updates are gated and do not create a competing gradient with accuracy. To improve verifiability, we will revise the abstract to include a concise statement of the conditional advantage procedure and reference the key formulation from the main text. revision: yes

  2. Referee: [Abstract] Abstract: The manuscript states that 'extensive evaluations demonstrate' orders-of-magnitude reduction in tool use with simultaneous accuracy gains, yet supplies no accuracy rates on trajectories, difficulty stratification, advantage variance within the conditioned subset, or ablation results. This leaves the central empirical claim unverifiable and does not address the risk that conditioning efficiency solely on accurate trajectories could introduce selection bias on low-complexity queries.

    Authors: The referee is correct that the abstract itself contains no numerical results or stratification details. The full experimental section reports aggregate accuracy improvements alongside tool-use reductions, with some difficulty-based breakdowns and ablations on the conditional mechanism. However, specific per-trajectory accuracy rates, advantage variance within the accurate subset, and explicit selection-bias diagnostics are not currently tabulated. Conditioning on accuracy does not inherently bias toward low-complexity queries because the accuracy channel remains unconditional and must be satisfied across the full distribution before efficiency refinement occurs; this curriculum structure encourages self-reliance on harder cases as well. To make the claims fully verifiable, we will expand the abstract with a brief summary of key metrics and add the requested stratification, variance statistics, and bias analysis to the experiments section in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces HDPO as a methodological reframing that separates accuracy maximization from conditional efficiency enforcement on accurate trajectories only. This is presented as an independent architectural choice whose benefits are demonstrated via evaluations on the resulting Metis model. No equations, derivations, or self-citations are shown that reduce the claimed orthogonality, cognitive curriculum, or performance gains to fitted parameters, prior self-work, or definitional tautologies. The central claims rest on empirical results rather than any load-bearing reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

From abstract alone, no explicit free parameters, axioms, or invented entities are introduced beyond standard RL assumptions of advantage estimation and trajectory sampling.

pith-pipeline@v0.9.0 · 5578 in / 1046 out tokens · 28164 ms · 2026-05-10T18:30:50.007002+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents

    cs.CL 2026-05 unverdicted novelty 7.0

    TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.

  2. Hallucinations Undermine Trust; Metacognition is a Way Forward

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs need metacognition to align expressed uncertainty with their actual knowledge boundaries, moving beyond knowledge expansion to reduce confident errors.

Reference graph

Works this paper leans on

62 extracted references · 42 canonical work pages · cited by 2 Pith papers · 18 internal anchors

  1. [1]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  4. [4]

    Sft or rl? an early investigation into training r1-like reasoning large vision-language models

    Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models.arXiv preprint arXiv:2504.11468, 2025

  5. [5]

    Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

    Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, et al. Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

  6. [6]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  7. [7]

    Gemini 3.1 pro: A smarter model for your most complex tasks, 2026

    Google. Gemini 3.1 pro: A smarter model for your most complex tasks, 2026. Accessed: 2026-02-19

  8. [8]

    Deepeyesv2: Toward agentic multimodal model

    Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025

  9. [9]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  10. [10]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

  11. [11]

    Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

  12. [12]

    Mini-o3: Scaling up reasoning patterns and interaction turns for visual search

    Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025

  13. [13]

    Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025

    Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025

  14. [14]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  15. [15]

    Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension.arXiv preprint arXiv:2404.16790, 2024

    Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension.arXiv preprint arXiv:2404.16790, 2024

  16. [16]

    Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

  17. [17]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  18. [18]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025

  19. [19]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 14

  20. [20]

    Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning.CoRR, 2025

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, et al. Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning.CoRR, 2025

  21. [21]

    Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu MinhuiWu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma Gongque, Shanglin Lei, Yifan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 200...

  22. [22]

    V-thinker: Interactive thinking with images.arXiv preprint arXiv:2511.04460, 2025

    Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, et al. V-thinker: Interactive thinking with images.arXiv preprint arXiv:2511.04460, 2025

  23. [23]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  24. [24]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  25. [25]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

  26. [26]

    Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration

    Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, and Jianwei Yin. Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6613–6629, 2025

  27. [27]

    arXiv preprint arXiv:2512.17312 (2025)

    Qi Song, Honglin Li, Yingchen Yu, Haoyi Zhou, Lin Yang, Song Bai, Qi She, Zilong Huang, and Yunqing Zhao. Codedance: A dynamic tool-integrated mllm for executable visual reasoning.arXiv preprint arXiv:2512.17312, 2025

  28. [28]

    Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

    Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025

  29. [29]

    Sketch-in-latents: Eliciting unified reasoning in mllms.CoRR, abs/2512.16584, 2025

    Jintao Tong, Jiaqi Gu, Yujing Lou, Lubin Fan, Yixiong Zou, Yue Wu, Jieping Ye, and Ruixuan Li. Sketch-in-latents: Eliciting unified reasoning in mllms.arXiv preprint arXiv:2512.16584, 2025

  30. [30]

    Emosync: Multi-stage reasoning with multimodal large language models for fine-grained emotion recognition

    Jintao Tong, Shiwei Li, Zijian Zhuang, Jinghan Hu, and Yixiong Zou. Emosync: Multi-stage reasoning with multimodal large language models for fine-grained emotion recognition. InProceedings of the 3rd International Workshop on Multimodal and Responsible Affective Computing, pages 95–99, 2025

  31. [31]

    Swimbird: Eliciting switchable reasoning mode in hybrid autoregressive mllms.arXiv preprint arXiv:2602.06040, 2026

    Jintao Tong, Shilin Yan, Hongwei Xue, Xiaojun Tang, Kunyu Shi, Guannan Zhang, Ruixuan Li, and Yixiong Zou. Swimbird: Eliciting switchable reasoning mode in hybrid autoregressive mllms.arXiv preprint arXiv:2602.06040, 2026

  32. [32]

    AdaTooler-V: Adaptive Tool-Use for Images and Videos

    Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li, Sicheng Gao, Meng Meng, Xu Zhou, Manyuan Zhang, Yuzhang Shang, et al. Adatooler-v: Adaptive tool-use for images and videos. arXiv preprint arXiv:2512.16918, 2025

  33. [33]

    Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025

  34. [34]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

  35. [35]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  36. [36]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 15

  37. [37]

    Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

    Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7907–7915, 2025

  38. [38]

    Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

    Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

  39. [39]

    Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024

    Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024

  40. [40]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  41. [41]

    Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025

    Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025

  42. [42]

    V?: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084– 13094, 2024

  43. [43]

    Visual Anchors

    Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

  44. [44]

    Llava-cot: Let vision language models reason step-by-step

    Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2087–2098, 2025

  45. [45]

    Crosslmm: Decoupling long video sequences from lmms via dual cross-attention mechanisms

    Shilin Yan, Jiaming Han, Joey Tsai, Hongwei Xue, Rongyao Fang, Lingyi Hong, Ziyu Guo, and Ray Zhang. Crosslmm: Decoupling long video sequences from lmms via dual cross-attention mechanisms. arXiv preprint arXiv:2505.17020, 2025

  46. [46]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  47. [47]

    Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954, 2025

    En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, et al. Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954, 2025

  48. [48]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

  49. [49]

    Zhang, K

    Kaichen Zhang, Keming Wu, Zuhao Yang, Bo Li, Kairui Hu, Bin Wang, Ziwei Liu, Xingxuan Li, and Lidong Bing. Openmmreasoner: Pushing the frontiers for multimodal reasoning with an open and general recipe.arXiv preprint arXiv:2511.16334, 2025

  50. [50]

    Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186. Springer, 2024

  51. [51]

    Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025

    Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025

  52. [52]

    Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024

    Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024

  53. [53]

    Skywork-r1v4: Toward agentic multimodal intelligence through interleaved thinking with images and deepresearch.arXiv preprint arXiv:2512.02395, 2025

    Yifan Zhang, Liang Hu, Haofeng Sun, Peiyu Wang, Yichen Wei, Shukang Yin, Jiangbo Pei, Wei Shen, Peng Xia, Yi Peng, et al. Skywork-r1v4: Toward agentic multimodal intelligence through interleaved thinking with images and deepresearch.arXiv preprint arXiv:2512.02395, 2025. 16

  54. [54]

    Think3d: Thinking with space for spatial reasoning

    Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, et al. Think3d: Thinking with space for spatial reasoning.arXiv preprint arXiv:2601.13029, 2026

  55. [55]

    arXiv preprint arXiv:2507.07998 (2025)

    Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, and Chen Wei. Pyvision: Agentic vision with dynamic tooling.arXiv preprint arXiv:2507.07998, 2025

  56. [56]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

  57. [57]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

  58. [58]

    ISBN 9798331314385

    Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836, 2024. 17 Appendix A System Prompt The system prompt is presented in Figure 6. It explicitly defines the available tool, its calling format,...

  59. [59]

    ISTRRE-PULA\

    Be decisive:Once you are confident in your answer, provide it immediately. Verification is worthwhile only when you are genuinely uncertain about a critical detail, not as a routine step. # Output Format Always start with<reason>...</reason>, then choose one: Option 1 — Use a tool: <reason>Why this tool call is necessary and what you expect to learn.</rea...

  60. [60]

    [Die Buchtvon Nizza -the bay of nice by Marc Chagall: (1962) ArtPrintPoster | AntiquariatReinhold Berg eKInh. R.Berg]

  61. [61]

    [Sold at Auction: Marc Chagall, Signed and Attributed to Marc Chagall Lithograph]

  62. [62]

    Die Buchtvon Nizza

    [Sold at Auction: Marc Chagall, MARC CHAGALL (1887-1985)]… … <reason>The image search results suggest that the painting is titled "Die Buchtvon Nizza" by Marc Chagall and was completed in 1962. The information is consistent across multiple search results, indicating that the year of completion is 1962.</reason><answer>1962</answer> Metis (Ours) Figure 8:S...