pith. machine review for the scientific record. sign in

arxiv: 2604.24339 · v1 · submitted 2026-04-27 · 💻 cs.CV · cs.AI

Recognition: unknown

See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection

Naiming Liu, Shuning Wang, Tong Wang, Yumeng Zhang, Zhiheng Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords visuallow-levelreasoningfeedbackmodelstextbfanswercues
0
0 comments X

The pith

ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models read images and text together but often overlook fine details like textures or object boundaries and rarely double-check their own conclusions. ForeSight adds simple visual tools that extract low-level cues such as edges or regions and inserts them into the model's thinking chain. It also creates a mask-based feedback loop so the model can re-examine parts of the image and revise its answer. Reinforcement learning uses final answer correctness as the reward, letting the model learn when to call tools or reflect. The authors test this on a new dataset built from saliency benchmarks that focuses on character recognition and grounding.

Core claim

Experimental results demonstrate that the ForeSight-7B model significantly outperforms other models with the same parameter scale, and even surpasses the current SOTA closed-source models on certain metrics.

Load-bearing premise

The new CG-SalBench dataset, constructed from SalBench, provides an unbiased and comprehensive test of the claimed reasoning improvements without construction artifacts that favor the proposed method.

Figures

Figures reproduced from arXiv: 2604.24339 by Naiming Liu, Shuning Wang, Tong Wang, Yumeng Zhang, Zhiheng Wu.

Figure 1
Figure 1. Figure 1: An example of ForeSight’s inference process. Our model dynamically invokes relevant tools during reasoning and integrates their outputs for deeper inference. After generating an initial draft answer, the model produces visual feedback based on this answer and conducts further reflection to verify the correct￾ness of the answer. Models (VLMs) to enhance the ability of solving complex tasks [23, 38, 41]. Whi… view at source ↗
Figure 2
Figure 2. Figure 2: The illustration of CG-SalBench. Each sample con￾tains multiple similar distractors and one distinct target, which differs in one or more specific visual characters. The task is to output the bounding box (bbox) and character of a specific object. The left image is a natural image (O3), and the right image is a synthetic image (P3). evaluate the low-level visual saliency perception capabili￾ties of large v… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of ForeSight’s two-stage framework. Stage 1: We use high-quality tool calls and visual reflection data for supervised fine-tuning to teach the model how to utilize external tools and generate coherent inference trajectories effectively. Stage 2: We further apply reinforcement learning and the GRPO algorithm, combined with tool-based rewards and several other complementary rewards, to achieve end-t… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of images using visual tools. O3 repre￾sents natural image, and P3 represents synthetic data. that enhance the model’s grounding capability. Combined with CoT reasoning, the model can selectively invoke this tool to strengthen local image features, thereby improving the accuracy and robustness of subsequent reasoning. Visual Reflection. The visual reflection module is a core component of our … view at source ↗
Figure 5
Figure 5. Figure 5: V 2 -CoT Data Construction Process. All VLMs are based on the Qwen2.5-VL series. The raw dataset is split into the tool-free subset and the tool-dependent subset according to perfor￾mance differences between 7B and 32B models. The 72B model generates structured CoT annotations with visual cues and reflec￾tion for both types, and rule-based cleaning yields a balanced, high-quality dataset for SFT and RL tra… view at source ↗
Figure 6
Figure 6. Figure 6: An example of data annotation. B. Qualitative Analysis Cases of Visual Tools. This section provides qualita￾tive examples to demonstrate the effectiveness of our pro￾posed method view at source ↗
Figure 8
Figure 8. Figure 8: An example of using the Zoom-In tool in natural image. view at source ↗
Figure 10
Figure 10. Figure 10: An example of using all tools in natural image. view at source ↗
Figure 12
Figure 12. Figure 12: An example of using the Color tool in synthetic image. view at source ↗
Figure 14
Figure 14. Figure 14: An example of visual reflection in natural image. view at source ↗
Figure 15
Figure 15. Figure 15: An example of visual reflection in synthetic image. view at source ↗
read the original abstract

Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these problems, this paper proposes a unified multimodal interleaved reasoning framework \textbf{ForeSight}, which enables VLMs to \textbf{See Further} with low-level visual cues and \textbf{Think Deeper} with effective visual feedback. First, it introduces a set of low-level visual tools to integrate essential visual information into the reasoning chain, mitigating the neglect of fine-grained visual features. Second, a mask-based visual feedback mechanism is elaborated to incorporate visual reflection into the thinking process, enabling the model to dynamically re-examine and update its answers. Driven by RL, ForeSight learns to autonomously decide on tool invocation and answer verification, with the final answer accuracy as the reward signal. To evaluate the performance of the proposed framework, we construct a new dataset, Character and Grounding SalBench (CG-SalBench), based on the SalBench dataset. Experimental results demonstrate that the ForeSight-7B model significantly outperforms other models with the same parameter scale, and even surpasses the current SOTA closed-source models on certain metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on domain assumptions that low-level visual cues and mask-based reflection will improve VLM reasoning when optimized by RL; no explicit free parameters or new physical entities are introduced.

axioms (2)
  • domain assumption Low-level visual information can be effectively integrated into VLM reasoning chains via dedicated tools
    Invoked in the first component of the framework to address neglect of fine-grained features.
  • domain assumption Mask-based visual feedback enables dynamic re-examination and answer updating in the reasoning process
    Core premise of the second component for incorporating visual reflection.
invented entities (1)
  • ForeSight framework no independent evidence
    purpose: Unified multimodal interleaved reasoning with tool invocation and answer verification
    New proposed system combining visual tools and reflection under RL; no independent evidence provided beyond the paper's experiments.

pith-pipeline@v0.9.0 · 5542 in / 1384 out tokens · 35388 ms · 2026-05-08T04:45:17.810245+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

95 extracted references · 39 canonical work pages · 20 internal anchors

  1. [1]

    Introducing claude 4.https : / / www

    Anthropic. Introducing claude 4.https : / / www . anthropic.com/news/claude-4, 2025. Accessed: 2025-07-27. 6, 7

  2. [2]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 2

  3. [3]

    Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond, 2023

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond, 2023. 2

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6, 7

  5. [5]

    Seed-1.6.https://seed.bytedance

    ByteDance. Seed-1.6.https://seed.bytedance. com/zh/seed1_6, 2024. Accessed: 2025-07-27. 6, 7

  6. [6]

    Coco- stuff: Thing and stuff classes in context

    Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. InCVPR, pages 1209–1218, 2018. 8

  7. [7]

    A survey of computer vision detection, visual slam algorithms, and their applications in energy-efficient autonomous systems.Ener- gies, 17(20):5177, 2024

    Lu Chen, Gun Li, Weisi Xie, Jie Tan, Yang Li, Junfeng Pu, Lizhu Chen, Decheng Gan, and Weimin Shi. A survey of computer vision detection, visual slam algorithms, and their applications in energy-efficient autonomous systems.Ener- gies, 17(20):5177, 2024. 7

  8. [8]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InICCV, pages 24185–24198, 2024. 2

  9. [9]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna

    Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023. 1

  10. [10]

    A re- view of the f-measure: its history, properties, criticism, and alternatives.ACM Computing Surveys, 56(3):1–24, 2023

    Peter Christen, David J Hand, and Nishadi Kirielle. A re- view of the f-measure: its history, properties, criticism, and alternatives.ACM Computing Surveys, 56(3):1–24, 2023. 7

  11. [11]

    Deep rein- forcement learning for mention-ranking coreference models

    Kevin Clark and Christopher D Manning. Deep rein- forcement learning for mention-ranking coreference models. arXiv preprint arXiv:1609.08667, 2016. 6

  12. [12]

    Instructblip: Towards general-purpose vision- language models with instruction tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning. InNeurIPS, pages 49250–49267, 2023. 2

  13. [13]

    Mathsensei: A tool-augmented large lan- guage model for mathematical reasoning

    Debrup Das, Debopriyo Banerjee, Somak Aditya, and Ashish Kulkarni. Mathsensei: A tool-augmented large lan- guage model for mathematical reasoning. InNAACL-HLT,

  14. [14]

    Virgo: A preliminary exploration on reproducing o1-like mllm

    Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Virgo: A preliminary ex- ploration on reproducing o1-like mllm.arXiv preprint arXiv:2501.01904, 2025. 3

  15. [15]

    ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

    Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yu- jia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv preprint arXiv:2504.11536, 2025. 1

  16. [16]

    Mme: A comprehensive evaluation bench- mark for multimodal large language models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models. InNIPS, 2025. 8

  17. [17]

    Interleaved-modal chain-of-thought

    Jun Gao, Yongqi Li, Ziqiang Cao, and Wenjie Li. Interleaved-modal chain-of-thought. InCVPR, pages 19520– 19529, 2025. 1

  18. [18]

    Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% perfor- mance.Visual Intelligence, 2(1):32, 2024

    Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, et al. Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% perfor- mance.Visual Intelligence, 2(1):32, 2024. 2

  19. [19]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 1

  20. [20]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3

  21. [21]

    Llava-uhd: an lmm perceiving any aspect ratio and high- resolution images

    Zonghao Guo, Ruyi Xu, Yuan Yao, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, and Gao Huang. Llava-uhd: an lmm perceiving any aspect ratio and high- resolution images. InECCV, pages 390–406. Springer, 2024. 2

  22. [22]

    Breaking the reasoning barrier a sur- vey on LLM complex reasoning through the lens of self- evolution

    Tao He, Hao Li, Jingchang Chen, Runxuan Liu, Yixin Cao, Lizi Liao, Zihao Zheng, Zheng Chu, Jiafeng Liang, Ming Liu, and Bing Qin. Breaking the reasoning barrier a sur- vey on LLM complex reasoning through the lens of self- evolution. InACL, pages 7377–7417, Vienna, Austria, 2025. Association for Computational Linguistics. 1

  23. [23]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

  24. [24]

    Vision-language models can’t see the obvious

    Ngoc Dung Huynh, Phuc H Le-Khac, Wamiq Reyaz Para, Ankit Singh, and Sanath Narayan. Vision-language models can’t see the obvious. InICCV, pages 24159–24169, 2025. 2, 3

  25. [25]

    Vlm-r 3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought

    Chaoya Jiang, Yongrui Heng, Wei Ye, Han Yang, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, and Shikun Zhang. VLM-R3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.arXiv preprint arXiv:2505.16192, 2025. 3

  26. [26]

    Referitgame: Referring to objects in pho- tographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. InEMNLP, pages 787–798, 2014. 8

  27. [27]

    Do saliency models detect odd-one-out tar- 9 gets? new datasets and evaluations.arXiv preprint arXiv:2005.06583, 2020

    Iuliia Kotseruba, Calden Wloka, Amir Rasouli, and John K Tsotsos. Do saliency models detect odd-one-out tar- 9 gets? new datasets and evaluations.arXiv preprint arXiv:2005.06583, 2020. 3

  28. [28]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, pages 19730–19742. PMLR, 2023. 2

  29. [29]

    Towards faithful chain-of-thought: Large lan- guage models are bridging reasoners.arXiv preprint arXiv:2405.18915, 2024

    Jiachun Li, Pengfei Cao, Yubo Chen, Kang Liu, and Jun Zhao. Towards faithful chain-of-thought: Large lan- guage models are bridging reasoners.arXiv preprint arXiv:2405.18915, 2024. 3

  30. [30]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023. 2

  31. [31]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 7

  32. [32]

    Visual abstract thinking empow- ers multimodal reasoning.arXiv preprint arXiv:2505.20164,

    Dairu Liu, Ziyue Wang, Minyuan Ruan, Fuwen Luo, Chi Chen, Peng Li, and Yang Liu. Visual abstract thinking empowers multimodal reasoning.arXiv preprint arXiv:2505.20164, 2025. 2

  33. [33]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 2

  34. [34]

    Llava-plus: Learning to use tools for creating multi- modal agents

    Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multi- modal agents. InECCV, pages 126–142. Springer, 2024. 2

  35. [35]

    Mmbench: Is your multi-modal model an all-around player? InECCV, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, pages 216–233. Springer, 2024. 8

  36. [36]

    Chain-of-action: Faithful and multimodal question answer- ing through large language models

    Zhenyu Pan, Haozheng Luo, Manling Li, and Han Liu. Chain-of-action: Faithful and multimodal question answer- ing through large language models. InICLR, 2025. 2

  37. [37]

    Instruction Tuning with GPT-4

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4.arXiv preprint arXiv:2304.03277, 2023. 1

  38. [38]

    Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

    Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl. arXiv preprint arXiv:2503.07536, 2025. 1

  39. [39]

    ToolRL: Reward is All Tool Learning Needs

    Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-T ¨ur, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025. 3

  40. [40]

    arXiv preprint arXiv:2505.15298 1,

    Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao, et al. Agentthink: A unified framework for tool-augmented chain-of-thought reasoning in vision- language models for autonomous driving.arXiv preprint arXiv:2505.15298, 2025. 2

  41. [41]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 1

  42. [42]

    Benchmarking object de- tectors with coco: A new path forward

    Shweta Singh, Aayan Yadav, Jitesh Jain, Humphrey Shi, Justin Johnson, and Karan Desai. Benchmarking object de- tectors with coco: A new path forward. InEuropean Con- ference on Computer Vision, pages 279–295. Springer, 2024. 7

  43. [43]

    Dynamic rewarding with prompt optimization enables tuning-free self-alignment of language models.arXiv preprint arXiv:2411.08733, 2024

    Somanshu Singla, Zhen Wang, Tianyang Liu, Abdullah Ash- faq, Zhiting Hu, and Eric P Xing. Dynamic rewarding with prompt optimization enables tuning-free self-alignment of language models.arXiv preprint arXiv:2411.08733, 2024. 3

  44. [44]

    Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025

    Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025. 3

  45. [45]

    Qwen3-vl: 4b & 8b dense models.http : / / www

    Alibaba Cloud (Tongyi Qianwen Team). Qwen3-vl: 4b & 8b dense models.http : / / www . shurl . cc / 476cd8cde87202e5eb123ad4e32d5766, 2025. Ac- cessed: 2025-11-07. 6, 7

  46. [46]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 6, 7

  47. [47]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025. 1

  48. [48]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 3

  49. [49]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2

  50. [50]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 2

  51. [51]

    Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025

    Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025. 3

  52. [52]

    Chain-of-thought prompting elicits reasoning in large lan- guage models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models. InNeurIPS, pages 24824–24837, 2022. 3

  53. [53]

    Autotir: Autonomous tools integrated reasoning via reinforcement learning.arXiv preprint arXiv:2507.21836,

    Yifan Wei, Xiaoyan Yu, Yixuan Weng, Tengfei Pan, Ang- sheng Li, and Li Du. Autotir: Autonomous tools inte- grated reasoning via reinforcement learning.arXiv preprint arXiv:2507.21836, 2025. 1 10

  54. [54]

    Llava-cot: Let vision language models reason step-by-step

    Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InICCV, pages 2087–2098,

  55. [55]

    Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte 10 carlo tree search.arXiv preprint arXiv:2412.18319, 2024

    Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search.arXiv preprint arXiv:2412.18319, 2024. 3

  56. [56]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 2

  57. [57]

    Unitbox: An advanced object detection net- work

    Jiahui Yu, Yuning Jiang, Zhangyang Wang, Zhimin Cao, and Thomas Huang. Unitbox: An advanced object detection net- work. InACM MM, pages 516–520, 2016. 6

  58. [58]

    arXiv preprint arXiv:2509.18154 , year=

    Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. Minicpm-v 4.5: Cooking effi- cient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025. 2

  59. [59]

    Gthinker: Towards general multi- modal reasoning via cue-guided rethinking.arXiv preprint arXiv:2506.01078, 2025

    Yufei Zhan, Ziheng Wu, Yousong Zhu, Rongkun Xue, Ruipu Luo, Zhenghao Chen, Can Zhang, Yifan Li, Zhentao He, Zheming Yang, et al. Gthinker: Towards general multi- modal reasoning via cue-guided rethinking.arXiv preprint arXiv:2506.01078, 2025. 2, 3

  60. [60]

    A complete survey on generative ai (aigc): Is chatgpt from gpt-4 to gpt-5 all you need? arXiv preprint arXiv:2303.11717, 2023

    Chaoning Zhang, Chenshuang Zhang, Sheng Zheng, Yu Qiao, Chenghao Li, Mengchun Zhang, Sumit Kumar Dam, Chu Myaet Thwal, Ye Lin Tun, Le Luang Huy, et al. A com- plete survey on generative ai (aigc): Is chatgpt from gpt-4 to gpt-5 all you need?arXiv preprint arXiv:2303.11717, 2023. 6, 7

  61. [61]

    Improve vision language model chain-of-thought reasoning, 2024

    Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, and Yiming Yang. Improve vision language model chain-of- thought reasoning.arXiv preprint arXiv:2410.16198, 2024. 3

  62. [62]

    R1-reward: Training multimodal reward model through stable reinforcement learning.CoRR, abs/2505.02835, 2025

    Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, et al. R1-reward: Training multimodal reward model through stable reinforcement learning.arXiv preprint arXiv:2505.02835, 2025. 6

  63. [63]

    Swift:a scal- able lightweight infrastructure for fine-tuning, 2024

    Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yun- lin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. Swift:a scal- able lightweight infrastructure for fine-tuning, 2024. 7

  64. [64]

    Knowledge augmented complex problem solving with large language models: A survey.arXiv preprint arXiv:2505.03418, 2025

    Da Zheng, Lun Du, Junwei Su, Yuchen Tian, Yuqi Zhu, Jintian Zhang, Lanning Wei, Ningyu Zhang, and Huajun Chen. Knowledge augmented complex problem solving with large language models: A survey.arXiv preprint arXiv:2505.03418, 2025. 1

  65. [65]

    DriveAgent-R1: Advancing VLM-based Autonomous Driving with Active Perception and Hybrid Thinking

    Weicheng Zheng, Xiaofei Mao, Nanfei Ye, Pengxiang Li, Kun Zhan, Xianpeng Lang, and Hang Zhao. Driveagent- r1: Advancing vlm-based autonomous driving with hy- brid thinking and active perception.arXiv preprint arXiv:2507.20879, 2025. 2

  66. [66]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 2, 8

  67. [67]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 2

  68. [68]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 6, 7 11 See Further, Think Deeper: Advancing VLM’s Reasoning Ability with Low-level Visual Cues a...

  69. [69]

    thinking with visual cues

    This guidance enables the model to appropriately in- voke designated visual tools, interpret the generated visual feedback, and seamlessly connect these intermediate steps to derive the final answer. This mechanism fundamen- tally realizes the principle of “thinking with visual cues”. The Canny tool, Zoom-In tool, and Color tool are referred to as<CANNY>,...

  70. [70]

    Initial Observation: Describe the general content of the image

  71. [71]

    Tool Selection Reasoning: Explain why the first tool in the list was selected based on the initial observation

  72. [72]

    Tool Application Results: Describe the results from using the first tool

  73. [73]

    Subsequent Analysis: If multiple tools are listed, repeat steps 2 and 3 for each tool

  74. [74]

    Output Requirements:

    Conclusion: Based on the above findings, the target region and Target Character can be derived. Output Requirements:

  75. [76]

    Relevance: Ignore irrelevant information and avoid unnecessary details

  76. [77]

    Format: Use natural and fluent language without timestamps or extra instructions

  77. [78]

    Output Structure: Clearly separate each tool’s description and analysis sequentially

  78. [79]

    Provide a complete explanation without timestamps or additional instructions

    Output Format: Use natural, fluent language. Provide a complete explanation without timestamps or additional instructions. Use ‘<>‘ to mark tools; choose from:<CANNY>,<ROI>,<COLOR>. The sentence structure should be: ”Based on the available information, we analyzed that we need to use<Tool 1>, and then use<Tool 2>for further analysis.<Attribution analysis>...

  79. [80]

    Do not include bounding boxes<Target Region>or Target Characters<Target Character>during the analysis; these are only shown in the final result

  80. [81]

    Descriptions of different tools should be separated and presented sequentially

Showing first 80 references.