pith. machine review for the scientific record. sign in

arxiv: 2604.12896 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.LG

Recognition: unknown

Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs

Muhammad Kamran Janjua , Hugo Silva , Di Niu , Bahador Rashidi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:26 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords Perception Programsmultimodal language modelsvisual tool reasoninglanguage summariesBLINK benchmarktraining-free methodvision toolsperception tasks
0
0 comments X

The pith

Converting vision tool outputs into language summaries unlocks accurate visual reasoning in multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal language models receive outputs from vision tools such as depth or flow estimators yet often fail to use them because raw pixel data clashes with their language-based reasoning. The paper identifies the representation of those tool results as the core bottleneck rather than model size or number of tool calls. Perception Programs rewrite dense visual outputs into short structured text summaries that the models can parse and reason over directly. This change alone raises accuracy from 41 percent to 86 percent on multi-view reasoning and delivers a 22 percent average gain across BLINK tasks, including on smaller models and without any training. A reader would care because the method shows a simple way to extract more value from existing vision tools by aligning their outputs with how the models already think.

Core claim

The paper claims that the bottleneck in visual tool reasoning for MLLMs is the pixel-level representation of tool outputs, which is misaligned with language-native strengths. By introducing Perception Programs that convert these outputs into compact language summaries, models can effectively parse and reason over the visual cues, achieving substantial accuracy improvements across six perception-centric tasks without training or model changes.

What carries the argument

Perception Programs (P²), a method that rewrites dense tool outputs such as depth maps and optical flow into compact, structured language-native summaries that MLLMs can directly use for reasoning.

Load-bearing premise

The compact language summaries produced by Perception Programs preserve all task-critical visual information from the original tool outputs without introducing systematic errors or omissions.

What would settle it

A controlled test on a perception task where a single critical detail is lost in the textual summary but remains visible in the raw tool output, and P² accuracy falls below the raw-pixel baseline.

Figures

Figures reproduced from arXiv: 2604.12896 by Bahador Rashidi, Di Niu, Hugo Silva, Muhammad Kamran Janjua.

Figure 1
Figure 1. Figure 1: Teaser. Turning dense tool outputs into a Perception Program makes a general MLLM behave as if it can read the modality. Given same query and input pair, (a) standard MLLM underuses the visual signal [6], (b) a tool-only route exposes the modality but stays pixel-level, while (c) our P2 summarizes it into a language-native structure that MLLM can reliably reason over, yielding large gains. Abstract Multimo… view at source ↗
Figure 2
Figure 2. Figure 2: Under-utilization of Visual Information. Given several ICL examples along with depth map, GPT-5 Mini fails to recover near-to-far ordering from it (see Sec. 5.1), indicating limited uti￾lization of the modality. the rich information provided by these tools. When raw tool outputs are serialized and supplied to the model, they appear as dense, low-level visual tokens that misalign with the language-native re… view at source ↗
Figure 3
Figure 3. Figure 3: Perception Program Instantiations. Top: Tool outputs. Bottom: P2 instantiations of those respective tools. 4. Evaluation Setup We posit that Perception Program (P 2 ) lets MLLMs read visual modalities. To test this, we consider BLINK bench￾mark [8], a suite of 14 perception-focused tasks. We con￾centrate evaluation on six sub-tasks from the benchmark where additional modalities are especially informative. … view at source ↗
Figure 4
Figure 4. Figure 4: Mean ∆ vs. prior SOTA across BLINK. Bars show average accuracy improvement (percentage points) of each method over task-wise (except HardBLINK) prior state-of-the-art (at point zero; see Tab. 1). Positive values indicate gains over prior SOTA; negative values indicate regressions. Numeric ∆ are written in￾side/beyond the bars along with their method names. VS denotes Visual Sketchpad. P 2 on HardBLINK-5 [1… view at source ↗
Figure 5
Figure 5. Figure 5: GPT-5 Depth Modality Analysis. Left: Kendall’s Tau (y-axis) between ground-truth and GPT-5 Mini reconstructed P 2 decreases as the grid is refined (x-axis). Right: HardBLINK-5 accuracy (y-axis) using GPT-5 Mini’s reconstructions (GPT Recon.) across grids (x-axis). the BLINK task. We additionally contrast them with two oracles: a naive algorithm that simply considers Euclidean distance between the reference… view at source ↗
Figure 7
Figure 7. Figure 7: Average Tokens/Sample. Comparison of Visual Sketch￾pad (with GPT-5 Mini as LLM) and GPT-5 Mini with P 2 on average token per sample across all six sub-tasks. P 2 incurs significantly lower token cost. 7. Perception Program Details In this section, we discuss additional details about Perception Programs. Mainly, we provide samples of prompts for both frontier and open-source MLLMs. We also detail in-context… view at source ↗
Figure 8
Figure 8. Figure 8: Open-Source Prompt with P 2 ICL. We present a sample prompt for open-source MLLMs (e.g., Qwen3VL and InternVL3.5). We include a single in-context example describing the use of P 2 . Both Qwen3VL and InternVL3.5 reason with the given P 2 to com￾pute the correct answer (A) to the question. • VisProg [9]: A neuro-symbolic approach in which the model uses in-context learning to generate modular Python programs… view at source ↗
Figure 9
Figure 9. Figure 9: Open-Source Prompt with Tool ICL. We present a sample prompt for open-source MLLMs (e.g., Qwen3VL and InternVL3.5). We include a single in-context example describing the use of optical flow as tool output. Note how the example clearly illustrates that blue hues indicate left while warm hues indicate right motion, the MLLM (Qwen3VL in this example) concludes the same that flow is dominated by blue hues, yet… view at source ↗
Figure 10
Figure 10. Figure 10: Correspondence Distribution. Illustration of distribu￾tion of correspondence markers in the visual correspondence task from BLINK validation set. each candidate, we compute its Euclidean distance to the reference point and select the neighbor with the smallest distance. We then take this neighbor’s ‘r’ coordinate as the mapped position in the second image. Finally, we compare this mapped location with the… view at source ↗
read the original abstract

Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool-generated visual cues, MLLMs often fail to benefit from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception and reliance on language priors. We argue that, in problems where vision tools can provide the necessary visual cues, the bottleneck is not more tool calls or larger MLLMs, it is how tool outputs are represented. We introduce Perception Programs (P$^2$), a training-free, model-agnostic method that rewrites tool outputs into compact, structured, language-native summaries that MLLMs can directly parse and reason over. Across six perception-centric tasks in BLINK, P$^2$ consistently yields large improvements over base models and raw tool-augmented baselines. With GPT-5 Mini as the base model, P$^2$ raises its accuracy from 41.35\% to 86.47\% on multi-view reasoning, from 52.42\% to 81.45\% on relative depth, and achieves a 22\% average gain across tasks, setting new state-of-the-art results. Even on smaller MLLMs, e.g., InternVL3.5-4B and Qwen3VL-4B, we observe 15-40\% absolute gains from P$^2$, surpassing prior agentic, supervised, and RL-based tool-use methods-without any training or model modifications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that MLLMs fail to benefit from vision tool outputs (depth, flow, correspondence) when fed as raw pixels because these are misaligned with language-native reasoning; instead, Perception Programs (P²) convert tool outputs into compact structured language summaries, yielding large gains on BLINK tasks (e.g., +45% on multi-view reasoning and +29% on relative depth with GPT-5 Mini) without training or model changes, outperforming agentic, supervised, and RL baselines.

Significance. If the empirical results hold after addressing the representation-faithfulness concern, the work would demonstrate that output representation—not additional tools, scale, or training—is the primary bottleneck for tool-augmented visual reasoning in MLLMs. The training-free, model-agnostic nature and consistent gains across model sizes (including 4B-scale MLLMs) would be a notable practical contribution, shifting focus from complex agent loops to simpler cue reformatting.

major comments (2)
  1. [Method and Experiments] The central claim that language summaries unlock visual reasoning rests on the untested assumption that they preserve all task-critical information from raw tool outputs. The paper does not report any quantitative comparison (e.g., information loss metrics or human verification) between the original pixel/tool data and the generated summaries on the relative-depth or multi-view tasks, leaving open the possibility that gains arise from noise reduction rather than faithful cue encoding.
  2. [Experiments] Table 1 (or equivalent results table) reports 22% average gain and SOTA numbers, but lacks error bars, statistical significance tests, or details on how many runs were averaged; given the headline deltas (e.g., 41.35% → 86.47%), this weakens confidence that the improvements are robust rather than sensitive to prompt or summarizer variance.
minor comments (2)
  1. [Abstract] The abstract and introduction use “GPT-5 Mini” without clarifying whether this is a hypothetical or specific released model; add a footnote or citation for reproducibility.
  2. [Method] Clarify the exact template or prompting strategy used to generate the structured summaries (e.g., fixed code vs. LLM calls) in the method section, as this affects claims of being fully training-free and model-agnostic.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments identify key areas for strengthening our claims and evaluation. We respond point-by-point to the major comments below and outline revisions to address them directly.

read point-by-point responses
  1. Referee: [Method and Experiments] The central claim that language summaries unlock visual reasoning rests on the untested assumption that they preserve all task-critical information from raw tool outputs. The paper does not report any quantitative comparison (e.g., information loss metrics or human verification) between the original pixel/tool data and the generated summaries on the relative-depth or multi-view tasks, leaving open the possibility that gains arise from noise reduction rather than faithful cue encoding.

    Authors: We appreciate this observation on the need for explicit validation of information preservation. Our Perception Programs are constructed to extract and verbalize only the task-relevant cues (e.g., explicit relative depth orderings or correspondence relations) while discarding extraneous pixel details, which aligns with the observed large gains that would be unlikely from noise reduction alone. That said, we did not include direct quantitative information-loss metrics or human verification in the original submission. In the revision we will add a dedicated analysis subsection that reports human-rated faithfulness scores on sampled outputs for the relative-depth and multi-view tasks, together with a comparison of task-critical elements retained versus discarded from the raw tool data. revision: yes

  2. Referee: [Experiments] Table 1 (or equivalent results table) reports 22% average gain and SOTA numbers, but lacks error bars, statistical significance tests, or details on how many runs were averaged; given the headline deltas (e.g., 41.35% → 86.47%), this weakens confidence that the improvements are robust rather than sensitive to prompt or summarizer variance.

    Authors: We agree that error bars, statistical significance testing, and explicit details on run averaging are necessary to demonstrate robustness, especially given potential variance from the summarization step. The original results were obtained from single runs per configuration. In the revised manuscript we will augment Table 1 (and all main result tables) with standard deviations computed over five independent runs that vary the summarizer prompt phrasing and random seeds where applicable. We will also report the results of paired statistical significance tests (e.g., McNemar’s test) between P² and the raw-tool baselines, and we will clarify the exact averaging procedure in the experimental section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains from Perception Programs rest on external benchmarks

full rationale

The paper introduces Perception Programs as a training-free conversion of tool outputs (depth maps, flow, etc.) into compact language summaries, then reports accuracy lifts on the BLINK benchmark suite against raw-tool and prior baselines. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the derivation chain. All performance numbers are measured on held-out tasks and compared to independent methods, so the central claim remains externally falsifiable and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim depends on the empirical effectiveness of language summaries as a faithful proxy for pixel data. No free parameters, standard axioms, or invented physical entities are introduced beyond the method itself.

invented entities (1)
  • Perception Programs (P²) no independent evidence
    purpose: Rewrite tool outputs into compact language summaries
    New method proposed in the paper to address representation misalignment.

pith-pipeline@v0.9.0 · 5616 in / 1280 out tokens · 64506 ms · 2026-05-10T15:26:00.278019+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 15 canonical work pages · 5 internal anchors

  1. [1]

    Per- ception tokens enhance visual reasoning in multimodal lan- guage models

    Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G Shapiro, and Ranjay Krishna. Per- ception tokens enhance visual reasoning in multimodal lan- guage models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3836–3845, 2025. 1, 2, 3, 5, 7, 4

  2. [2]

    arXiv preprint arXiv:2504.13180 , year=

    Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Tri- antafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, et al. Per- ceptionlm: Open-access data and models for detailed visual understanding.arXiv preprint arXiv:2504.13180, 2025. 2, 5, 6

  3. [3]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long con- text, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 5, 6

  4. [4]

    Mmfac- tory: A universal solution search engine for vision-language tasks.arXiv preprint arXiv:2412.18072, 2024

    Wan-Cyuan Fan, Tanzila Rahman, and Leonid Sigal. Mmfac- tory: A universal solution search engine for vision-language tasks.arXiv preprint arXiv:2412.18072, 2024. 3, 5, 6, 7, 2

  5. [5]

    GRIT: Teaching MLLMs to Think with Images

    Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images.arXiv preprint arXiv:2505.15879, 2025. 3

  6. [6]

    Hidden in plain sight: Vlms overlook their visual repre- sentations, 2025

    Stephanie Fu, Tyler Bonnen, Devin Guillory, and Trevor Dar- rell. Hidden in plain sight: Vlms overlook their visual repre- sentations, 2025. 1, 2, 3

  7. [7]

    Llmdet: Learning strong open-vocabulary object detectors under the supervision of large language models

    Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, and Wei-Shi Zheng. Llmdet: Learning strong open-vocabulary object detectors under the supervision of large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14987–14997, 2025. 5

  8. [8]

    Blink: Multimodal large language mod- els can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language mod- els can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024. 1, 2, 5, 6, 8

  9. [9]

    Visual program- ming: Compositional visual reasoning without training

    Tanmay Gupta and Aniruddha Kembhavi. Visual program- ming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14953–14962, 2023. 2, 3

  10. [10]

    Vi- sual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Informa- tion Processing Systems, 37:139348–139379, 2024

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Vi- sual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Informa- tion Processing Systems, 37:139348–139379, 2024. 3, 5, 6, 7, 8, 1, 2, 4

  11. [11]

    Zebra-cot: A dataset for interleaved vision language reasoning

    Ang Li, Charles Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, et al. Zebra-cot: A dataset for interleaved vision language reasoning.arXiv preprint arXiv:2507.16746,

  12. [12]

    Llava-plus: Learning to use tools for creating multimodal agents

    Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multimodal agents. InEuropean conference on computer vision, pages 126–142. Springer, 2024. 7

  13. [13]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean Conference on Computer Vision, pages 38–55. Springer, 2024. 8

  14. [14]

    Latte: Learning to think with vision specialists

    Zixian Ma, Jianguo Zhang, Zhiwei Liu, Jieyu Zhang, Jun- tao Tan, Manli Shu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Caiming Xiong, Ranjay Krishna, and Silvio Savarese. Latte: Learning to think with vision specialists. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2025. 2, 3, 5, 6, 8

  15. [15]

    Gpt-5 system card, 2025

    OpenAI. Gpt-5 system card, 2025. 5, 6, 1

  16. [16]

    Per- ception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36:42748–42761, 2023

    Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Re- casens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Per- ception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36:42748–42761, 2023. 1

  17. [17]

    Grounded reinforcement learning for visual reasoning.arXiv preprint arXiv:2505.23678, 2025

    Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J Tarr, Aviral Kumar, and Katerina Fragkiadaki. Grounded reinforcement learning for visual reasoning.arXiv preprint arXiv:2505.23678, 2025. 2, 5, 7

  18. [18]

    Loftr: Detector-free local feature matching with transformers

    Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xi- aowei Zhou. Loftr: Detector-free local feature matching with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8922–8931,

  19. [19]

    Vipergpt: Vi- sual inference via python execution for reasoning

    D´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Vi- sual inference via python execution for reasoning. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 11888–11898, 2023. 2, 3

  20. [20]

    Emergent correspondence from image diffusion.Advances in Neural Information Processing Systems, 36:1363–1389, 2023

    Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent correspondence from image diffusion.Advances in Neural Information Processing Systems, 36:1363–1389, 2023. 5

  21. [21]

    Tulip: Contrastive image-text learning with richer vision understanding

    Zineng Tang, Long Lian, Seun Eisape, Xudong Wang, Roei Herzig, Adam Yala, Alane Suhr, Trevor Darrell, and David M Chan. Tulip: Contrastive image-text learning with richer vision understanding. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4267–4277,

  22. [22]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025. 5, 6

  23. [23]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on com- puter vision, pages 402–419. Springer, 2020. 5

  24. [24]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 3

  25. [25]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 5, 6

  26. [26]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 5

  27. [27]

    Open vision reasoner: Transferring linguis- tic cognitive behavior for visual reasoning.arXiv preprint arXiv:2507.05255, 2025

    Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, et al. Open vision reasoner: Transferring linguis- tic cognitive behavior for visual reasoning.arXiv preprint arXiv:2507.05255, 2025. 5, 6, 2

  28. [28]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10371–10381, 2024. 5, 8

  29. [29]

    Machine mental imagery: Empower multimodal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025

    Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. Machine mental imagery: Empower multi- modal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025. 1, 2, 3, 5, 6, 7

  30. [30]

    Introducingvi- sual perception token into multimodal large language model

    Runpeng Yu, Xinyin Ma, and Xinchao Wang. Introducing visual perception token into multimodal large language model. arXiv preprint arXiv:2502.17425, 2025. 2

  31. [31]

    Socratic models: Composing zero-shot multimodal reasoning with language,

    Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choro- manski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, et al. So- cratic models: Composing zero-shot multimodal reasoning with language.arXiv preprint arXiv:2204.00598, 2022. 1

  32. [32]

    Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning

    Bob Zhang, Haoran Li, Tao Zhang, Cilin Yan, Jiayin Cai, and Yanbin Hao. Improving the reasoning of multi-image grounding in mllms via reinforcement learning.arXiv preprint arXiv:2507.00748, 2025. 7

  33. [33]

    Thyme: Think Beyond Images

    Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025. 2, 3, 5, 6

  34. [34]

    Vipact: Visual-perception enhancement via specialized vlm agent collaboration and tool-use.arXiv preprint arXiv:2410.16400, 2024

    Zhehao Zhang, Ryan Rossi, Tong Yu, Franck Dernoncourt, Ruiyi Zhang, Jiuxiang Gu, Sungchul Kim, Xiang Chen, Zichao Wang, and Nedim Lipka. Vipact: Visual-perception enhancement via specialized vlm agent collaboration and tool-use.arXiv preprint arXiv:2410.16400, 2024. 3

  35. [35]

    disable” ( −5.24 on V*) and “replace with placeholders

    Zetong Zhou, Dongping Chen, Zixian Ma, Zhihan Hu, Mingyang Fu, Sinan Wang, Yao Wan, Zhou Zhao, and Ran- jay Krishna. Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025. 2, 5, 6, 1, 4 Don’t Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs Supplementary Material Methods HardBLINK 3...

  36. [36]

    Mainly, we provide samples of prompts for both frontier and open-source MLLMs

    Perception Program Details In this section, we discuss additional details about Perception Programs. Mainly, we provide samples of prompts for both frontier and open-source MLLMs. We also detail in-context (ICL) example that we use to query the open-source MLLMs. Recall that frontier models, GPT-5 Mini and Gemini 2.5 Pro, work as is and do not require any...

  37. [37]

    left" or

    Additional Related Work In this section, we give a non-comprehensive summary of methods from the related work, expanding on some that were briefly mentioned while also introducing additional ones. We additionally note that several prior state-of-the-art BLINK results were obtained by methods that do not rely on tools, which we also include here. 8.1. Tool...

  38. [38]

    5.1 we discussed the quality of visual interpretation of current MLLMs

    Additional Experimental Details In Sec. 5.1 we discussed the quality of visual interpretation of current MLLMs. We expand the discussion on on vi- <system>...</system> <icl>You're given two frames from a static scene and an optical flow map between them. Decide the global camera motion: (A) left (B) right Interpret horizontal motion from the flow: rightwa...

  39. [39]

    LLM Usage Statement In this manuscript, we used several MLLMs as part of our experimental setup and we have described the necessary details in Secs. 4 and 7. Other than that, we also used LLMs (ChatGPT) to help with refining the manuscript in terms of fixing grammatical errors in writing and with plotting codes for various figures. The authors did not use...