arxiv: 2505.15436 · v3 · submitted 2025-05-21 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

Xintong Zhang , Zhi Gao , Bofei Zhang , Pengxiang Li , Xiaowen Zhang , Yang Liu , Tao Yuan , Yuwei Wu

show 3 more authors

Yunde Jia Song-Chun Zhu Qing Li

Authors on Pith no claims yet

Pith reviewed 2026-05-17 05:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelschain-of-focusmultimodal reasoningvisual searchadaptive zoomingreinforcement learningefficient inference

0 comments

The pith

VLMs can reason more efficiently by adaptively searching and zooming into key image regions via Chain-of-Focus training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that vision-language models gain stronger multimodal reasoning by learning to focus and zoom on task-relevant image patches rather than processing entire high-resolution inputs uniformly. It does so through a two-stage process that first uses a visual agent to build a dataset of adaptive focus examples for supervised fine-tuning, then applies reinforcement learning to refine search and reasoning strategies using outcome rewards. A sympathetic reader would care because this approach could maintain or improve accuracy on visual tasks while reducing the computational cost of handling images at resolutions up to 4K.

Core claim

By constructing the MM-CoF dataset from a visual agent that identifies key regions for different resolutions and questions, fine-tuning Qwen2.5-VL on it, and then updating the model with reinforcement learning on accuracy and format rewards, the resulting system performs dynamic visual search and zooming that yields better results on visual reasoning benchmarks.

What carries the argument

The Chain-of-Focus (CoF) method, which lets the model adaptively identify and zoom into key image regions based on visual cues and the question.

If this is right

Performance on the V* benchmark improves by 5 percent across eight image resolutions from 224 to 4K compared with prior VLMs.
Multimodal reasoning becomes possible without forcing the entire image through high-resolution processing at every step.
The two-stage pipeline of supervised fine-tuning followed by reinforcement learning refines the model's search strategy without additional human-designed priors.
Deployment of VLMs in practical settings becomes more efficient because only selected regions need detailed analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same focusing mechanism could be tested on video sequences to see whether frame-by-frame adaptive search reduces compute while preserving temporal reasoning.
If the learned zoom policy generalizes, it might combine with existing compression techniques to further lower memory use during inference.
The approach suggests a route for making attention mechanisms in VLMs more like selective human vision rather than uniform grid processing.

Load-bearing premise

The visual agent that generates the training examples consistently picks the right regions without introducing biases that would limit performance on real user questions or new image distributions.

What would settle it

A controlled test in which the model is evaluated on images where the visual agent demonstrably misses the task-critical area and shows a clear drop in accuracy relative to full-image baselines.

read the original abstract

Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks. However, the multimodal reasoning capability has not been fully explored in existing models. In this paper, we propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions based on obtained visual cues and the given questions, achieving efficient multimodal reasoning. To enable this CoF capability, we present a two-stage training pipeline, including supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct the MM-CoF dataset, comprising 3K samples derived from a visual agent designed to adaptively identify key regions to solve visual tasks with different image resolutions and questions. We use MM-CoF to fine-tune the Qwen2.5-VL model for cold start. In the RL stage, we leverage the outcome accuracies and formats as rewards to update the Qwen2.5-VL model, enabling further refining the search and reasoning strategy of models without human priors. Our model achieves significant improvements on multiple benchmarks. On the V* benchmark that requires strong visual reasoning capability, our model outperforms existing VLMs by 5% among 8 image resolutions ranging from 224 to 4K, demonstrating the effectiveness of the proposed CoF method and facilitating the more efficient deployment of VLMs in practical applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The CoF approach trains adaptive search and zoom into VLMs via agent-built data plus RL, but the 5% gain on V* may trace more to the agent's labeling choices than to the reasoning pipeline itself.

read the letter

The main thing to know is that this paper adds a two-stage training recipe for making VLMs dynamically search and zoom on key image regions instead of always ingesting the full frame. They build a 3K-sample MM-CoF dataset with a visual agent, run supervised fine-tuning on Qwen2.5-VL, then apply RL using accuracy and format rewards to sharpen the strategy without extra human rules. The reported result is a 5% lift on the V* benchmark across eight resolutions from 224 to 4K, plus gains on other tasks, aimed at more efficient high-resolution reasoning. What is actually new is the concrete tying of adaptive zooming to a custom agent-derived dataset and the subsequent RL refinement step. The pipeline itself is laid out clearly and the focus on practical deployment for detailed visual tasks is a sensible goal. The work does a decent job showing how the method could reduce compute on varying image sizes. The soft spots sit mainly in the evidence and the data source. The abstract states the 5% improvement but gives no numbers on baselines, error bars, or whether training compute and model size were matched. More importantly, the training labels come from an external visual agent, so any systematic preferences the agent has for certain regions or resolutions could be what the model learns to copy. Without an ablation that swaps in human annotations or a deliberately different agent, it is difficult to separate the CoF contribution from the agent's own biases. This paper is for people working on efficient VLMs and visual reasoning pipelines who want concrete ideas for agent-assisted data and RL fine-tuning. A reader already thinking about high-resolution multimodal tasks would pick up usable details on the dataset construction and training stages. It deserves a serious referee because the method is concrete, the efficiency angle is timely, and the experiments can be checked and strengthened in review.

Referee Report

2 major / 2 minor

Summary. The paper proposes Chain-of-Focus (CoF), an adaptive visual search and zooming mechanism for VLMs to enable efficient multimodal reasoning. It introduces a two-stage training pipeline: supervised fine-tuning (SFT) on the 3K-sample MM-CoF dataset generated by an external visual agent that identifies task-relevant regions across varying resolutions and questions, followed by reinforcement learning (RL) using outcome accuracy and format rewards to refine the Qwen2.5-VL base model. The resulting model is reported to outperform existing VLMs by 5% on the V* benchmark across eight image resolutions ranging from 224 to 4K.

Significance. If the performance gains can be attributed specifically to the CoF mechanism rather than dataset construction artifacts, the approach could support more compute-efficient VLM inference on high-resolution inputs by dynamically focusing computation on relevant regions. The combination of SFT for cold-start initialization and RL for strategy refinement follows established patterns in reasoning model training and may generalize to other visual grounding tasks.

major comments (2)

[§3.2] §3.2 (Dataset Construction): The MM-CoF dataset labels are produced by an external visual agent, yet no ablation is presented that substitutes human region annotations or a deliberately mismatched agent. This omission is load-bearing for the central claim, as the reported 5% V* gain occurs after SFT on this dataset; without such controls it is impossible to separate the contribution of CoF reasoning from any systematic region-selection biases (e.g., saliency heuristics or resolution-dependent cropping) embedded in the agent's policy.
[§4] §4 (Experiments): The abstract and results claim a 5% improvement on V* across 224–4K resolutions, but the manuscript supplies no details on the exact baselines compared, statistical significance tests, error bars, or controls for confounding factors such as total training compute or model size. These omissions prevent verification that the gain is robust and attributable to CoF rather than experimental setup.

minor comments (2)

The abstract states the model 'outperforms existing VLMs by 5% among 8 image resolutions' but does not list the precise resolutions or the per-resolution breakdown; adding a table or figure with these values would improve clarity.
Notation for the visual agent and its output format is introduced without a dedicated diagram or pseudocode; a small illustrative example of one CoF trajectory would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our work. We have addressed each of the major comments below, providing clarifications and committing to revisions where necessary to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Dataset Construction): The MM-CoF dataset labels are produced by an external visual agent, yet no ablation is presented that substitutes human region annotations or a deliberately mismatched agent. This omission is load-bearing for the central claim, as the reported 5% V* gain occurs after SFT on this dataset; without such controls it is impossible to separate the contribution of CoF reasoning from any systematic region-selection biases (e.g., saliency heuristics or resolution-dependent cropping) embedded in the agent's policy.

Authors: We acknowledge the importance of isolating the CoF mechanism from potential biases in the dataset construction process. The visual agent was employed to generate adaptive region labels that simulate human-like focusing across different resolutions and questions, as detailed in §3.2. While a full human-annotated version of the 3K-sample dataset would be ideal for comparison, it is practically challenging due to annotation costs and time. In the revised manuscript, we will add an ablation study using a mismatched agent (e.g., one that selects regions based on simple saliency without task awareness) to better control for biases. Additionally, we will include a discussion on how the subsequent RL stage allows the model to refine strategies beyond the initial agent policy, thereby attributing gains more directly to the CoF reasoning. revision: partial
Referee: [§4] §4 (Experiments): The abstract and results claim a 5% improvement on V* across 224–4K resolutions, but the manuscript supplies no details on the exact baselines compared, statistical significance tests, error bars, or controls for confounding factors such as total training compute or model size. These omissions prevent verification that the gain is robust and attributable to CoF rather than experimental setup.

Authors: We agree that providing more rigorous experimental details is essential for verifying the reported improvements. In the updated Section 4, we will specify the exact baseline models used (including their versions and training details), report error bars from repeated experiments, include statistical significance testing for the 5% gain on V*, and add controls to ensure fair comparison in terms of model size and total training compute. These additions will help confirm that the performance gains are attributable to the proposed CoF approach. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an empirical two-stage training pipeline (SFT on MM-CoF dataset generated by an external visual agent, followed by RL using outcome accuracies and format rewards) applied to Qwen2.5-VL, with performance gains reported on the independent V* benchmark across resolutions. No equations, self-definitional reductions, fitted-input predictions, or load-bearing self-citations are present that would make the claimed 5% improvement equivalent to the inputs by construction. The central result remains an external empirical observation rather than a tautological renaming or forced outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract describes an empirical training pipeline but introduces no explicit free parameters, mathematical axioms, or new postulated entities; the central claim rests on the effectiveness of the described SFT and RL stages.

pith-pipeline@v0.9.0 · 5587 in / 1168 out tokens · 44422 ms · 2026-05-17T05:31:35.813719+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-stage training pipeline... supervised fine-tuning on MM-Adaptive-CoF SFT dataset... reinforcement learning with adaptive group-aware reward (AGAR)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reduces zoom-in operations by 75%... nearly 50% fewer tokens

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning
cs.MM 2026-05 unverdicted novelty 7.0

UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
LAGO: Language-Guided Adaptive Object-Region Focus for Zero-Shot Visual-Text Alignment
cs.CV 2026-05 unverdicted novelty 7.0

LAGO achieves state-of-the-art zero-shot performance with fewer image regions by using class-agnostic object discovery followed by confidence-controlled language-guided refinement and dual-channel aggregation.
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space
cs.CV 2025-12 unverdicted novelty 7.0

DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.
Training Multi-Image Vision Agents via End2End Reinforcement Learning
cs.CV 2025-12 unverdicted novelty 7.0

IMAgent trains a multi-image vision agent via pure end-to-end RL with visual reflection tools and a two-layer motion trajectory masking strategy, reaching SOTA on single- and multi-image benchmarks while revealing too...
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoni...
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
cs.CV 2026-05 unverdicted novelty 6.0

Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
Large Vision-Language Models Get Lost in Attention
cs.AI 2026-05 unverdicted novelty 6.0

In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy a...
Visual Reasoning through Tool-supervised Reinforcement Learning
cs.CV 2026-04 unverdicted novelty 6.0

ToolsRL trains MLLMs via a tool-specific then accuracy-focused RL curriculum to master visual tools for complex reasoning tasks.
Zoom Consistency: A Free Confidence Signal in Multi-Step Visual Grounding Pipelines
cs.CV 2026-04 unverdicted novelty 6.0

Zoom consistency provides a geometric, cross-model confidence signal in zoom-in grounding pipelines that correlates with prediction correctness and enables modest gains in specialist-generalist routing.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment
cs.IR 2026-04 unverdicted novelty 6.0

ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
cs.CV 2026-04 unverdicted novelty 6.0

MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
cs.CV 2026-04 unverdicted novelty 6.0

Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception
cs.CV 2025-11 unverdicted novelty 6.0

CropVLM uses reinforcement learning to learn image zooming policies that boost fine-grained perception in VLMs on out-of-domain high-resolution tasks without labeled boxes, synthetic data, or VLM changes.
DeepEyesV2: Toward Agentic Multimodal Model
cs.CV 2025-11 unverdicted novelty 6.0

DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
cs.CV 2026-04 unverdicted novelty 5.0

Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
cs.CV 2026-04 unverdicted novelty 5.0

A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering
cs.CV 2026-04 unverdicted novelty 5.0

MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 18 Pith papers · 21 internal anchors

[1]

Tallyqa: Answering complex counting ques- tions

Manoj Acharya, Kushal Kafle, and Christopher Kanan. Tallyqa: Answering complex counting ques- tions. InProceedings of the AAAI Conference on Artificial Intelligence, pages 8076–8084, 2019

work page 2019
[2]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Song- cen Xu, Changrui Chen, Chunsheng Wu, et al. Llava-onevision-1.5: Fully open framework for de- mocratized multimodal training.arXiv preprint arXiv:2509.23661, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Claude 3.7 Sonnet.https://www.anthropic.com/claude/sonnet, 2025

Anthropic. Claude 3.7 Sonnet.https://www.anthropic.com/claude/sonnet, 2025. Ac- cessed: 2025-05-10

work page 2025
[4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 29

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Llama-nemotron: Efficient reasoning models

Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, et al. Llama-nemotron: Efficient reasoning models. arXiv preprint arXiv:2505.00949, 2025

work page arXiv 2025
[7]

InternVL-3: A Vision-Language Foundation Model for Continued Learning, 2024

Zhaoyang Chen, Yichi Zhang, Ruijie Quan, Zuchao Li, Geng-Xin Miao, Hai-Tao Zheng, Ziyue Wang, Guansong Lu, Jing Wen, Jia-Qi Lin, Wei-Shi Zheng, Ping Luo, and Wen-Guan Wang. InternVL-3: A Vision-Language Foundation Model for Continued Learning, 2024

work page 2024
[8]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

work page 2024
[9]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, G Heigold, S Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2020

work page 2020
[10]

Multi-modal agent tuning: Building a vlm-driven agent for efficient tool usage

Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaojian Ma, Tao Yuan, Yue Fan, Yuwei Wu, Yunde Jia, Song- Chun Zhu, and Qing Li. Multi-modal agent tuning: Building a vlm-driven agent for efficient tool usage. InProceedings of the International Conference on Learning Representations, 2025

work page 2025
[11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodal- ity, Long Context, and Next Generation Agentic Capabilities, 2025

Gemini Team and Google. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodal- ity, Long Context, and Next Generation Agentic Capabilities, 2025

work page 2025
[12]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shi- rong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024. 30

work page 2024
[14]

Visual program distillation: Distilling tools and programmatic reasoning into vision-language models

Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, and Ariel Fuxman. Visual program distillation: Distilling tools and programmatic reasoning into vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9590–9601, 2024

work page 2024
[15]

Prompting large language model with context and pre-answer for knowledge-based vqa.Pattern Recognition, 151:110399, 2024

Zhongjian Hu, Peng Yang, Yuanshuang Jiang, and Zijian Bai. Prompting large language model with context and pre-answer for knowledge-based vqa.Pattern Recognition, 151:110399, 2024

work page 2024
[16]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6700–6709, 2019

work page 2019
[17]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Hel- yar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Language–image consistency augmentation and distillation network for visual grounding.Pattern Recognition, 166:111663, 2025

Xiao Ke, Peirong Xu, and Wenzhong Guo. Language–image consistency augmentation and distillation network for visual grounding.Pattern Recognition, 166:111663, 2025

work page 2025
[20]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023

work page 2023
[21]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yan- wei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding

Geng Li, Jinglin Xu, Yunzhen Zhao, and Yuxin Peng. Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9098–9108, 2025. 31

work page 2025
[23]

Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. InProceedings of the International Conference on Machine Learning, pages 19730–19742, 2023

work page 2023
[24]

Multi- modal arXiv: A dataset for improving scientific comprehension of large vision-language models

Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multi- modal arXiv: A dataset for improving scientific comprehension of large vision-language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14369–14387, 2024

work page 2024
[25]

Reward-guided speculative decoding for efficient llm reasoning.arXiv preprint arXiv:2501.19324, 2025

Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, and Caiming Xiong. Reward-guided speculative decoding for efficient llm reasoning.arXiv preprint arXiv:2501.19324, 2025

work page arXiv 2025
[26]

Llava- next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava- next: Improved reasoning, ocr, and world knowledge, January 2024

work page 2024
[27]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, volume 36, 2024

work page 2024
[28]

UniVG-R1: Reasoning guided universal visual grounding with reinforce- ment learning.arXiv preprint arXiv:2506.12151, 2025

Shiyin Liu, Bo Shi, Ruijie Chen, Jian Shi, Junfeng Li, Jinsong Tang, Liujun Tang, Han Zhang, Zonglin Lu, Ke Sun, and Qi Chen. UniVG-R1: Reasoning guided universal visual grounding with reinforce- ment learning.arXiv preprint arXiv:2506.12151, 2025

work page arXiv 2025
[29]

VisualToolAgent (VisTA): A reinforcement learning framework for visual tool selection.arXiv preprint arXiv:2506.12152, 2025

Shiyin Liu, Bo Shi, Ruijie Chen, Jian Shi, Junfeng Li, Jinsong Tang, Liujun Tang, Han Zhang, Zonglin Lu, Ke Sun, and Qi Chen. VisualToolAgent (VisTA): A reinforcement learning framework for visual tool selection.arXiv preprint arXiv:2506.12152, 2025

work page arXiv 2025
[30]

Pope: Parallel-object-property-evaluation benchmark for large language models.arXiv preprint arXiv:2209.03058, 2022

Ziyang Ma, Yibo Song, Tiannan Su, Wenhao Li, Zesong Liu, Yuan Ren, Min Zhou, Shuai Yang, and Rongrong He. Pope: Parallel-object-property-evaluation benchmark for large language models.arXiv preprint arXiv:2209.03058, 2022

work page arXiv 2022
[31]

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665, 2024. 32

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Gpt-4v(ision) system card

OpenAI. Gpt-4v(ision) system card. Technical report, OpenAI, 2023

work page 2023
[33]

Openai o3 and o4-mini system card

OpenAI. Openai o3 and o4-mini system card. Technical report, OpenAI, 2025

work page 2025
[34]

Deepeyes: Progressive visual analytics for designing deep neural networks.IEEE transactions on visualization and computer graphics, 24(1):98–108, 2017

Nicola Pezzotti, Thomas Höllt, Jan Van Gemert, Boudewijn PF Lelieveldt, Elmar Eisemann, and Anna Vilanova. Deepeyes: Progressive visual analytics for designing deep neural networks.IEEE transactions on visualization and computer graphics, 24(1):98–108, 2017

work page 2017
[35]

GRIT: Teaching MLLMs to think with images.arXiv preprint arXiv:2506.11993, 2025

Shiquan Qiu, Yixuan Liu, Honggang Yang, Zhaoyang Wu, Guangzhi Sun, Guoli Lv, Ying Jiang, Xiaoyu Li, Siyuan He, Xiang Gao, Yan Lu, Guangzong Li, and Bin Cui. GRIT: Teaching MLLMs to think with images.arXiv preprint arXiv:2506.11993, 2025

work page arXiv 2025
[36]

Tarr, Aviral Kumar, and Katerina Fragkiadaki

Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J. Tarr, Aviral Kumar, and Katerina Fragkiadaki. Grounded reinforcement learning for visual reasoning.arXiv preprint arXiv:2505.23678, 2025

work page arXiv 2025
[37]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

VGR: Visual Grounded Reasoning

Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, and Jun Xiao. VGR: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Coordinating explicit and implicit knowledge for knowledge- based vqa.Pattern Recognition, 151:110368, 2024

Qunbo Wang, Jing Liu, and Wenjun Wu. Coordinating explicit and implicit knowledge for knowledge- based vqa.Pattern Recognition, 151:110368, 2024

work page 2024
[41]

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models.arXiv preprint arXiv:2408.15556, 2024

Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models.arXiv preprint arXiv:2408.15556, 2024. 33

work page arXiv 2024
[42]

Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

work page arXiv 2025
[43]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022

work page 2022
[44]

SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spatialscore: Towards unified evaluation for multimodal spatial understanding.arXiv preprint arXiv:2505.17012, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

Penghao Wu and Saining Xie.v ⋆: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

work page 2024
[46]

Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images

Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, and Gao Huang. Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images.arXiv preprint arXiv:2403.11703, 2024

work page arXiv 2024
[47]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayi- heng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InProceedings of the International Conference on Learning Representations, 2023. 34

work page 2023
[51]

A Survey on Multimodal Large Language Models

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.arXiv preprint arXiv:2306.13549, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Cross-scene visual context parsing with large vision-language model.Pattern Recognition, page 111641, 2025

Guoqing Zhang, Shichao Kan, Lu Shi, Wanru Xu, Gaoyun An, and Yigang Cen. Cross-scene visual context parsing with large vision-language model.Pattern Recognition, page 111641, 2025

work page 2025
[53]

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, Liang Wang, Rong Jin, and Tieniu Tan. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for hu- mans?arXiv preprint arXiv:2408.13257, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Automatic Chain of Thought Prompting in Large Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models.arXiv preprint arXiv:2210.03493, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[56]

Instruction-guided visual masking.Advances in neural information processing systems, 37:126004–126031, 2024

Jinliang Zheng, Jianxiong Li, Sijie Cheng, Yinan Zheng, Jiaming Li, Jihao Liu, Yu Liu, Jingjing Liu, and Xianyuan Zhan. Instruction-guided visual masking.Advances in neural information processing systems, 37:126004–126031, 2024

work page 2024
[57]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Detection and tracking meet drones challenge.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 44(11):7380–7399, 2021

Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Heng Fan, Qinghua Hu, and Haibin Ling. Detection and tracking meet drones challenge.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 44(11):7380–7399, 2021. 35

work page 2021