pith. machine review for the scientific record. sign in

arxiv: 2505.15436 · v3 · submitted 2025-05-21 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-17 05:31 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelschain-of-focusmultimodal reasoningvisual searchadaptive zoomingreinforcement learningefficient inference
0
0 comments X

The pith

VLMs can reason more efficiently by adaptively searching and zooming into key image regions via Chain-of-Focus training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that vision-language models gain stronger multimodal reasoning by learning to focus and zoom on task-relevant image patches rather than processing entire high-resolution inputs uniformly. It does so through a two-stage process that first uses a visual agent to build a dataset of adaptive focus examples for supervised fine-tuning, then applies reinforcement learning to refine search and reasoning strategies using outcome rewards. A sympathetic reader would care because this approach could maintain or improve accuracy on visual tasks while reducing the computational cost of handling images at resolutions up to 4K.

Core claim

By constructing the MM-CoF dataset from a visual agent that identifies key regions for different resolutions and questions, fine-tuning Qwen2.5-VL on it, and then updating the model with reinforcement learning on accuracy and format rewards, the resulting system performs dynamic visual search and zooming that yields better results on visual reasoning benchmarks.

What carries the argument

The Chain-of-Focus (CoF) method, which lets the model adaptively identify and zoom into key image regions based on visual cues and the question.

If this is right

  • Performance on the V* benchmark improves by 5 percent across eight image resolutions from 224 to 4K compared with prior VLMs.
  • Multimodal reasoning becomes possible without forcing the entire image through high-resolution processing at every step.
  • The two-stage pipeline of supervised fine-tuning followed by reinforcement learning refines the model's search strategy without additional human-designed priors.
  • Deployment of VLMs in practical settings becomes more efficient because only selected regions need detailed analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same focusing mechanism could be tested on video sequences to see whether frame-by-frame adaptive search reduces compute while preserving temporal reasoning.
  • If the learned zoom policy generalizes, it might combine with existing compression techniques to further lower memory use during inference.
  • The approach suggests a route for making attention mechanisms in VLMs more like selective human vision rather than uniform grid processing.

Load-bearing premise

The visual agent that generates the training examples consistently picks the right regions without introducing biases that would limit performance on real user questions or new image distributions.

What would settle it

A controlled test in which the model is evaluated on images where the visual agent demonstrably misses the task-critical area and shows a clear drop in accuracy relative to full-image baselines.

read the original abstract

Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks. However, the multimodal reasoning capability has not been fully explored in existing models. In this paper, we propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions based on obtained visual cues and the given questions, achieving efficient multimodal reasoning. To enable this CoF capability, we present a two-stage training pipeline, including supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct the MM-CoF dataset, comprising 3K samples derived from a visual agent designed to adaptively identify key regions to solve visual tasks with different image resolutions and questions. We use MM-CoF to fine-tune the Qwen2.5-VL model for cold start. In the RL stage, we leverage the outcome accuracies and formats as rewards to update the Qwen2.5-VL model, enabling further refining the search and reasoning strategy of models without human priors. Our model achieves significant improvements on multiple benchmarks. On the V* benchmark that requires strong visual reasoning capability, our model outperforms existing VLMs by 5% among 8 image resolutions ranging from 224 to 4K, demonstrating the effectiveness of the proposed CoF method and facilitating the more efficient deployment of VLMs in practical applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Chain-of-Focus (CoF), an adaptive visual search and zooming mechanism for VLMs to enable efficient multimodal reasoning. It introduces a two-stage training pipeline: supervised fine-tuning (SFT) on the 3K-sample MM-CoF dataset generated by an external visual agent that identifies task-relevant regions across varying resolutions and questions, followed by reinforcement learning (RL) using outcome accuracy and format rewards to refine the Qwen2.5-VL base model. The resulting model is reported to outperform existing VLMs by 5% on the V* benchmark across eight image resolutions ranging from 224 to 4K.

Significance. If the performance gains can be attributed specifically to the CoF mechanism rather than dataset construction artifacts, the approach could support more compute-efficient VLM inference on high-resolution inputs by dynamically focusing computation on relevant regions. The combination of SFT for cold-start initialization and RL for strategy refinement follows established patterns in reasoning model training and may generalize to other visual grounding tasks.

major comments (2)
  1. [§3.2] §3.2 (Dataset Construction): The MM-CoF dataset labels are produced by an external visual agent, yet no ablation is presented that substitutes human region annotations or a deliberately mismatched agent. This omission is load-bearing for the central claim, as the reported 5% V* gain occurs after SFT on this dataset; without such controls it is impossible to separate the contribution of CoF reasoning from any systematic region-selection biases (e.g., saliency heuristics or resolution-dependent cropping) embedded in the agent's policy.
  2. [§4] §4 (Experiments): The abstract and results claim a 5% improvement on V* across 224–4K resolutions, but the manuscript supplies no details on the exact baselines compared, statistical significance tests, error bars, or controls for confounding factors such as total training compute or model size. These omissions prevent verification that the gain is robust and attributable to CoF rather than experimental setup.
minor comments (2)
  1. The abstract states the model 'outperforms existing VLMs by 5% among 8 image resolutions' but does not list the precise resolutions or the per-resolution breakdown; adding a table or figure with these values would improve clarity.
  2. Notation for the visual agent and its output format is introduced without a dedicated diagram or pseudocode; a small illustrative example of one CoF trajectory would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our work. We have addressed each of the major comments below, providing clarifications and committing to revisions where necessary to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Dataset Construction): The MM-CoF dataset labels are produced by an external visual agent, yet no ablation is presented that substitutes human region annotations or a deliberately mismatched agent. This omission is load-bearing for the central claim, as the reported 5% V* gain occurs after SFT on this dataset; without such controls it is impossible to separate the contribution of CoF reasoning from any systematic region-selection biases (e.g., saliency heuristics or resolution-dependent cropping) embedded in the agent's policy.

    Authors: We acknowledge the importance of isolating the CoF mechanism from potential biases in the dataset construction process. The visual agent was employed to generate adaptive region labels that simulate human-like focusing across different resolutions and questions, as detailed in §3.2. While a full human-annotated version of the 3K-sample dataset would be ideal for comparison, it is practically challenging due to annotation costs and time. In the revised manuscript, we will add an ablation study using a mismatched agent (e.g., one that selects regions based on simple saliency without task awareness) to better control for biases. Additionally, we will include a discussion on how the subsequent RL stage allows the model to refine strategies beyond the initial agent policy, thereby attributing gains more directly to the CoF reasoning. revision: partial

  2. Referee: [§4] §4 (Experiments): The abstract and results claim a 5% improvement on V* across 224–4K resolutions, but the manuscript supplies no details on the exact baselines compared, statistical significance tests, error bars, or controls for confounding factors such as total training compute or model size. These omissions prevent verification that the gain is robust and attributable to CoF rather than experimental setup.

    Authors: We agree that providing more rigorous experimental details is essential for verifying the reported improvements. In the updated Section 4, we will specify the exact baseline models used (including their versions and training details), report error bars from repeated experiments, include statistical significance testing for the 5% gain on V*, and add controls to ensure fair comparison in terms of model size and total training compute. These additions will help confirm that the performance gains are attributable to the proposed CoF approach. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an empirical two-stage training pipeline (SFT on MM-CoF dataset generated by an external visual agent, followed by RL using outcome accuracies and format rewards) applied to Qwen2.5-VL, with performance gains reported on the independent V* benchmark across resolutions. No equations, self-definitional reductions, fitted-input predictions, or load-bearing self-citations are present that would make the claimed 5% improvement equivalent to the inputs by construction. The central result remains an external empirical observation rather than a tautological renaming or forced outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract describes an empirical training pipeline but introduces no explicit free parameters, mathematical axioms, or new postulated entities; the central claim rests on the effectiveness of the described SFT and RL stages.

pith-pipeline@v0.9.0 · 5587 in / 1168 out tokens · 44422 ms · 2026-05-17T05:31:35.813719+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

    cs.MM 2026-05 unverdicted novelty 7.0

    UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...

  2. GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.

  3. LAGO: Language-Guided Adaptive Object-Region Focus for Zero-Shot Visual-Text Alignment

    cs.CV 2026-05 unverdicted novelty 7.0

    LAGO achieves state-of-the-art zero-shot performance with fewer image regions by using class-agnostic object discovery followed by confidence-controlled language-guided refinement and dual-channel aggregation.

  4. Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

    cs.CV 2025-12 unverdicted novelty 7.0

    DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.

  5. Training Multi-Image Vision Agents via End2End Reinforcement Learning

    cs.CV 2025-12 unverdicted novelty 7.0

    IMAgent trains a multi-image vision agent via pure end-to-end RL with visual reflection tools and a two-layer motion trajectory masking strategy, reaching SOTA on single- and multi-image benchmarks while revealing too...

  6. Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

    cs.CV 2026-05 unverdicted novelty 6.0

    SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...

  7. Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

    cs.CV 2026-05 unverdicted novelty 6.0

    SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoni...

  8. Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search

    cs.CV 2026-05 unverdicted novelty 6.0

    Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...

  9. Large Vision-Language Models Get Lost in Attention

    cs.AI 2026-05 unverdicted novelty 6.0

    In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.

  10. Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy a...

  11. Visual Reasoning through Tool-supervised Reinforcement Learning

    cs.CV 2026-04 unverdicted novelty 6.0

    ToolsRL trains MLLMs via a tool-specific then accuracy-focused RL curriculum to master visual tools for complex reasoning tasks.

  12. Zoom Consistency: A Free Confidence Signal in Multi-Step Visual Grounding Pipelines

    cs.CV 2026-04 unverdicted novelty 6.0

    Zoom consistency provides a geometric, cross-model confidence signal in zoom-in grounding pipelines that correlates with prediction correctness and enables modest gains in specialist-generalist routing.

  13. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.

  14. ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment

    cs.IR 2026-04 unverdicted novelty 6.0

    ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.

  15. Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

    cs.CV 2026-04 unverdicted novelty 6.0

    MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.

  16. Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

    cs.CV 2026-04 unverdicted novelty 6.0

    Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.

  17. CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

    cs.CV 2025-11 unverdicted novelty 6.0

    CropVLM uses reinforcement learning to learn image zooming policies that boost fine-grained perception in VLMs on out-of-domain high-resolution tasks without labeled boxes, synthetic data, or VLM changes.

  18. DeepEyesV2: Toward Agentic Multimodal Model

    cs.CV 2025-11 unverdicted novelty 6.0

    DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.

  19. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.

  20. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.

  21. MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering

    cs.CV 2026-04 unverdicted novelty 5.0

    MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 18 Pith papers · 21 internal anchors

  1. [1]

    Tallyqa: Answering complex counting ques- tions

    Manoj Acharya, Kushal Kafle, and Christopher Kanan. Tallyqa: Answering complex counting ques- tions. InProceedings of the AAAI Conference on Artificial Intelligence, pages 8076–8084, 2019

  2. [2]

    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Song- cen Xu, Changrui Chen, Chunsheng Wu, et al. Llava-onevision-1.5: Fully open framework for de- mocratized multimodal training.arXiv preprint arXiv:2509.23661, 2025

  3. [3]

    Claude 3.7 Sonnet.https://www.anthropic.com/claude/sonnet, 2025

    Anthropic. Claude 3.7 Sonnet.https://www.anthropic.com/claude/sonnet, 2025. Ac- cessed: 2025-05-10

  4. [4]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 29

  5. [5]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  6. [6]

    Llama-nemotron: Efficient reasoning models

    Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, et al. Llama-nemotron: Efficient reasoning models. arXiv preprint arXiv:2505.00949, 2025

  7. [7]

    InternVL-3: A Vision-Language Foundation Model for Continued Learning, 2024

    Zhaoyang Chen, Yichi Zhang, Ruijie Quan, Zuchao Li, Geng-Xin Miao, Hai-Tao Zheng, Ziyue Wang, Guansong Lu, Jing Wen, Jia-Qi Lin, Wei-Shi Zheng, Ping Luo, and Wen-Guan Wang. InternVL-3: A Vision-Language Foundation Model for Continued Learning, 2024

  8. [8]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

  9. [9]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, G Heigold, S Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2020

  10. [10]

    Multi-modal agent tuning: Building a vlm-driven agent for efficient tool usage

    Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaojian Ma, Tao Yuan, Yue Fan, Yuwei Wu, Yunde Jia, Song- Chun Zhu, and Qing Li. Multi-modal agent tuning: Building a vlm-driven agent for efficient tool usage. InProceedings of the International Conference on Learning Representations, 2025

  11. [11]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodal- ity, Long Context, and Next Generation Agentic Capabilities, 2025

    Gemini Team and Google. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodal- ity, Long Context, and Next Generation Agentic Capabilities, 2025

  12. [12]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shi- rong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  13. [13]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024. 30

  14. [14]

    Visual program distillation: Distilling tools and programmatic reasoning into vision-language models

    Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, and Ariel Fuxman. Visual program distillation: Distilling tools and programmatic reasoning into vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9590–9601, 2024

  15. [15]

    Prompting large language model with context and pre-answer for knowledge-based vqa.Pattern Recognition, 151:110399, 2024

    Zhongjian Hu, Peng Yang, Yuanshuang Jiang, and Zijian Bai. Prompting large language model with context and pre-answer for knowledge-based vqa.Pattern Recognition, 151:110399, 2024

  16. [16]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6700–6709, 2019

  17. [17]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  18. [18]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Hel- yar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  19. [19]

    Language–image consistency augmentation and distillation network for visual grounding.Pattern Recognition, 166:111663, 2025

    Xiao Ke, Peirong Xu, and Wenzhong Guo. Language–image consistency augmentation and distillation network for visual grounding.Pattern Recognition, 166:111663, 2025

  20. [20]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023

  21. [21]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yan- wei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  22. [22]

    Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding

    Geng Li, Jinglin Xu, Yunzhen Zhao, and Yuxin Peng. Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9098–9108, 2025. 31

  23. [23]

    Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. InProceedings of the International Conference on Machine Learning, pages 19730–19742, 2023

  24. [24]

    Multi- modal arXiv: A dataset for improving scientific comprehension of large vision-language models

    Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multi- modal arXiv: A dataset for improving scientific comprehension of large vision-language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14369–14387, 2024

  25. [25]

    Reward-guided speculative decoding for efficient llm reasoning.arXiv preprint arXiv:2501.19324, 2025

    Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, and Caiming Xiong. Reward-guided speculative decoding for efficient llm reasoning.arXiv preprint arXiv:2501.19324, 2025

  26. [26]

    Llava- next: Improved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava- next: Improved reasoning, ocr, and world knowledge, January 2024

  27. [27]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, volume 36, 2024

  28. [28]

    UniVG-R1: Reasoning guided universal visual grounding with reinforce- ment learning.arXiv preprint arXiv:2506.12151, 2025

    Shiyin Liu, Bo Shi, Ruijie Chen, Jian Shi, Junfeng Li, Jinsong Tang, Liujun Tang, Han Zhang, Zonglin Lu, Ke Sun, and Qi Chen. UniVG-R1: Reasoning guided universal visual grounding with reinforce- ment learning.arXiv preprint arXiv:2506.12151, 2025

  29. [29]

    VisualToolAgent (VisTA): A reinforcement learning framework for visual tool selection.arXiv preprint arXiv:2506.12152, 2025

    Shiyin Liu, Bo Shi, Ruijie Chen, Jian Shi, Junfeng Li, Jinsong Tang, Liujun Tang, Han Zhang, Zonglin Lu, Ke Sun, and Qi Chen. VisualToolAgent (VisTA): A reinforcement learning framework for visual tool selection.arXiv preprint arXiv:2506.12152, 2025

  30. [30]

    Pope: Parallel-object-property-evaluation benchmark for large language models.arXiv preprint arXiv:2209.03058, 2022

    Ziyang Ma, Yibo Song, Tiannan Su, Wenhao Li, Zesong Liu, Yuan Ren, Min Zhou, Shuai Yang, and Rongrong He. Pope: Parallel-object-property-evaluation benchmark for large language models.arXiv preprint arXiv:2209.03058, 2022

  31. [31]

    RouteLLM: Learning to Route LLMs with Preference Data

    Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665, 2024. 32

  32. [32]

    Gpt-4v(ision) system card

    OpenAI. Gpt-4v(ision) system card. Technical report, OpenAI, 2023

  33. [33]

    Openai o3 and o4-mini system card

    OpenAI. Openai o3 and o4-mini system card. Technical report, OpenAI, 2025

  34. [34]

    Deepeyes: Progressive visual analytics for designing deep neural networks.IEEE transactions on visualization and computer graphics, 24(1):98–108, 2017

    Nicola Pezzotti, Thomas Höllt, Jan Van Gemert, Boudewijn PF Lelieveldt, Elmar Eisemann, and Anna Vilanova. Deepeyes: Progressive visual analytics for designing deep neural networks.IEEE transactions on visualization and computer graphics, 24(1):98–108, 2017

  35. [35]

    GRIT: Teaching MLLMs to think with images.arXiv preprint arXiv:2506.11993, 2025

    Shiquan Qiu, Yixuan Liu, Honggang Yang, Zhaoyang Wu, Guangzhi Sun, Guoli Lv, Ying Jiang, Xiaoyu Li, Siyuan He, Xiang Gao, Yan Lu, Guangzong Li, and Bin Cui. GRIT: Teaching MLLMs to think with images.arXiv preprint arXiv:2506.11993, 2025

  36. [36]

    Tarr, Aviral Kumar, and Katerina Fragkiadaki

    Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J. Tarr, Aviral Kumar, and Katerina Fragkiadaki. Grounded reinforcement learning for visual reasoning.arXiv preprint arXiv:2505.23678, 2025

  37. [37]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  38. [38]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

  39. [39]

    VGR: Visual Grounded Reasoning

    Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, and Jun Xiao. VGR: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025

  40. [40]

    Coordinating explicit and implicit knowledge for knowledge- based vqa.Pattern Recognition, 151:110368, 2024

    Qunbo Wang, Jing Liu, and Wenjun Wu. Coordinating explicit and implicit knowledge for knowledge- based vqa.Pattern Recognition, 151:110368, 2024

  41. [41]

    Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models.arXiv preprint arXiv:2408.15556, 2024

    Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models.arXiv preprint arXiv:2408.15556, 2024. 33

  42. [42]

    Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

    Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

  43. [43]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022

  44. [44]

    SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

    Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spatialscore: Towards unified evaluation for multimodal spatial understanding.arXiv preprint arXiv:2505.17012, 2025

  45. [45]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

    Penghao Wu and Saining Xie.v ⋆: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

  46. [46]

    Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images

    Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, and Gao Huang. Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images.arXiv preprint arXiv:2403.11703, 2024

  47. [47]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  48. [48]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayi- heng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

  49. [49]

    R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025

  50. [50]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InProceedings of the International Conference on Learning Representations, 2023. 34

  51. [51]

    A Survey on Multimodal Large Language Models

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.arXiv preprint arXiv:2306.13549, 2023

  52. [52]

    Cross-scene visual context parsing with large vision-language model.Pattern Recognition, page 111641, 2025

    Guoqing Zhang, Shichao Kan, Lu Shi, Wanru Xu, Gaoyun An, and Yigang Cen. Cross-scene visual context parsing with large vision-language model.Pattern Recognition, page 111641, 2025

  53. [53]

    R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

    Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025

  54. [54]

    MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

    Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, Liang Wang, Rong Jin, and Tieniu Tan. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for hu- mans?arXiv preprint arXiv:2408.13257, 2024

  55. [55]

    Automatic Chain of Thought Prompting in Large Language Models

    Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models.arXiv preprint arXiv:2210.03493, 2022

  56. [56]

    Instruction-guided visual masking.Advances in neural information processing systems, 37:126004–126031, 2024

    Jinliang Zheng, Jianxiong Li, Sijie Cheng, Yinan Zheng, Jiaming Li, Jihao Liu, Yu Liu, Jingjing Liu, and Xianyuan Zhan. Instruction-guided visual masking.Advances in neural information processing systems, 37:126004–126031, 2024

  57. [57]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

  58. [58]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023

  59. [59]

    Detection and tracking meet drones challenge.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 44(11):7380–7399, 2021

    Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Heng Fan, Qinghua Hu, and Haibin Ling. Detection and tracking meet drones challenge.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 44(11):7380–7399, 2021. 35