Recognition: 2 theorem links
· Lean TheoremAdaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs
Pith reviewed 2026-05-17 05:31 UTC · model grok-4.3
The pith
VLMs can reason more efficiently by adaptively searching and zooming into key image regions via Chain-of-Focus training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By constructing the MM-CoF dataset from a visual agent that identifies key regions for different resolutions and questions, fine-tuning Qwen2.5-VL on it, and then updating the model with reinforcement learning on accuracy and format rewards, the resulting system performs dynamic visual search and zooming that yields better results on visual reasoning benchmarks.
What carries the argument
The Chain-of-Focus (CoF) method, which lets the model adaptively identify and zoom into key image regions based on visual cues and the question.
If this is right
- Performance on the V* benchmark improves by 5 percent across eight image resolutions from 224 to 4K compared with prior VLMs.
- Multimodal reasoning becomes possible without forcing the entire image through high-resolution processing at every step.
- The two-stage pipeline of supervised fine-tuning followed by reinforcement learning refines the model's search strategy without additional human-designed priors.
- Deployment of VLMs in practical settings becomes more efficient because only selected regions need detailed analysis.
Where Pith is reading between the lines
- The same focusing mechanism could be tested on video sequences to see whether frame-by-frame adaptive search reduces compute while preserving temporal reasoning.
- If the learned zoom policy generalizes, it might combine with existing compression techniques to further lower memory use during inference.
- The approach suggests a route for making attention mechanisms in VLMs more like selective human vision rather than uniform grid processing.
Load-bearing premise
The visual agent that generates the training examples consistently picks the right regions without introducing biases that would limit performance on real user questions or new image distributions.
What would settle it
A controlled test in which the model is evaluated on images where the visual agent demonstrably misses the task-critical area and shows a clear drop in accuracy relative to full-image baselines.
read the original abstract
Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks. However, the multimodal reasoning capability has not been fully explored in existing models. In this paper, we propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions based on obtained visual cues and the given questions, achieving efficient multimodal reasoning. To enable this CoF capability, we present a two-stage training pipeline, including supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct the MM-CoF dataset, comprising 3K samples derived from a visual agent designed to adaptively identify key regions to solve visual tasks with different image resolutions and questions. We use MM-CoF to fine-tune the Qwen2.5-VL model for cold start. In the RL stage, we leverage the outcome accuracies and formats as rewards to update the Qwen2.5-VL model, enabling further refining the search and reasoning strategy of models without human priors. Our model achieves significant improvements on multiple benchmarks. On the V* benchmark that requires strong visual reasoning capability, our model outperforms existing VLMs by 5% among 8 image resolutions ranging from 224 to 4K, demonstrating the effectiveness of the proposed CoF method and facilitating the more efficient deployment of VLMs in practical applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Chain-of-Focus (CoF), an adaptive visual search and zooming mechanism for VLMs to enable efficient multimodal reasoning. It introduces a two-stage training pipeline: supervised fine-tuning (SFT) on the 3K-sample MM-CoF dataset generated by an external visual agent that identifies task-relevant regions across varying resolutions and questions, followed by reinforcement learning (RL) using outcome accuracy and format rewards to refine the Qwen2.5-VL base model. The resulting model is reported to outperform existing VLMs by 5% on the V* benchmark across eight image resolutions ranging from 224 to 4K.
Significance. If the performance gains can be attributed specifically to the CoF mechanism rather than dataset construction artifacts, the approach could support more compute-efficient VLM inference on high-resolution inputs by dynamically focusing computation on relevant regions. The combination of SFT for cold-start initialization and RL for strategy refinement follows established patterns in reasoning model training and may generalize to other visual grounding tasks.
major comments (2)
- [§3.2] §3.2 (Dataset Construction): The MM-CoF dataset labels are produced by an external visual agent, yet no ablation is presented that substitutes human region annotations or a deliberately mismatched agent. This omission is load-bearing for the central claim, as the reported 5% V* gain occurs after SFT on this dataset; without such controls it is impossible to separate the contribution of CoF reasoning from any systematic region-selection biases (e.g., saliency heuristics or resolution-dependent cropping) embedded in the agent's policy.
- [§4] §4 (Experiments): The abstract and results claim a 5% improvement on V* across 224–4K resolutions, but the manuscript supplies no details on the exact baselines compared, statistical significance tests, error bars, or controls for confounding factors such as total training compute or model size. These omissions prevent verification that the gain is robust and attributable to CoF rather than experimental setup.
minor comments (2)
- The abstract states the model 'outperforms existing VLMs by 5% among 8 image resolutions' but does not list the precise resolutions or the per-resolution breakdown; adding a table or figure with these values would improve clarity.
- Notation for the visual agent and its output format is introduced without a dedicated diagram or pseudocode; a small illustrative example of one CoF trajectory would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive feedback on our work. We have addressed each of the major comments below, providing clarifications and committing to revisions where necessary to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Dataset Construction): The MM-CoF dataset labels are produced by an external visual agent, yet no ablation is presented that substitutes human region annotations or a deliberately mismatched agent. This omission is load-bearing for the central claim, as the reported 5% V* gain occurs after SFT on this dataset; without such controls it is impossible to separate the contribution of CoF reasoning from any systematic region-selection biases (e.g., saliency heuristics or resolution-dependent cropping) embedded in the agent's policy.
Authors: We acknowledge the importance of isolating the CoF mechanism from potential biases in the dataset construction process. The visual agent was employed to generate adaptive region labels that simulate human-like focusing across different resolutions and questions, as detailed in §3.2. While a full human-annotated version of the 3K-sample dataset would be ideal for comparison, it is practically challenging due to annotation costs and time. In the revised manuscript, we will add an ablation study using a mismatched agent (e.g., one that selects regions based on simple saliency without task awareness) to better control for biases. Additionally, we will include a discussion on how the subsequent RL stage allows the model to refine strategies beyond the initial agent policy, thereby attributing gains more directly to the CoF reasoning. revision: partial
-
Referee: [§4] §4 (Experiments): The abstract and results claim a 5% improvement on V* across 224–4K resolutions, but the manuscript supplies no details on the exact baselines compared, statistical significance tests, error bars, or controls for confounding factors such as total training compute or model size. These omissions prevent verification that the gain is robust and attributable to CoF rather than experimental setup.
Authors: We agree that providing more rigorous experimental details is essential for verifying the reported improvements. In the updated Section 4, we will specify the exact baseline models used (including their versions and training details), report error bars from repeated experiments, include statistical significance testing for the 5% gain on V*, and add controls to ensure fair comparison in terms of model size and total training compute. These additions will help confirm that the performance gains are attributable to the proposed CoF approach. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper describes an empirical two-stage training pipeline (SFT on MM-CoF dataset generated by an external visual agent, followed by RL using outcome accuracies and format rewards) applied to Qwen2.5-VL, with performance gains reported on the independent V* benchmark across resolutions. No equations, self-definitional reductions, fitted-input predictions, or load-bearing self-citations are present that would make the claimed 5% improvement equivalent to the inputs by construction. The central result remains an external empirical observation rather than a tautological renaming or forced outcome.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
two-stage training pipeline... supervised fine-tuning on MM-Adaptive-CoF SFT dataset... reinforcement learning with adaptive group-aware reward (AGAR)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
reduces zoom-in operations by 75%... nearly 50% fewer tokens
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 21 Pith papers
-
UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning
UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...
-
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
-
LAGO: Language-Guided Adaptive Object-Region Focus for Zero-Shot Visual-Text Alignment
LAGO achieves state-of-the-art zero-shot performance with fewer image regions by using class-agnostic object discovery followed by confidence-controlled language-guided refinement and dual-channel aggregation.
-
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space
DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.
-
Training Multi-Image Vision Agents via End2End Reinforcement Learning
IMAgent trains a multi-image vision agent via pure end-to-end RL with visual reflection tools and a two-layer motion trajectory masking strategy, reaching SOTA on single- and multi-image benchmarks while revealing too...
-
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...
-
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoni...
-
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
-
Large Vision-Language Models Get Lost in Attention
In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
-
Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models
Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy a...
-
Visual Reasoning through Tool-supervised Reinforcement Learning
ToolsRL trains MLLMs via a tool-specific then accuracy-focused RL curriculum to master visual tools for complex reasoning tasks.
-
Zoom Consistency: A Free Confidence Signal in Multi-Step Visual Grounding Pipelines
Zoom consistency provides a geometric, cross-model confidence signal in zoom-in grounding pipelines that correlates with prediction correctness and enables modest gains in specialist-generalist routing.
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
-
ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment
ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.
-
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
-
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
-
CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception
CropVLM uses reinforcement learning to learn image zooming policies that boost fine-grained perception in VLMs on out-of-domain high-resolution tasks without labeled boxes, synthetic data, or VLM changes.
-
DeepEyesV2: Toward Agentic Multimodal Model
DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.
-
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering
MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...
Reference graph
Works this paper leans on
-
[1]
Tallyqa: Answering complex counting ques- tions
Manoj Acharya, Kushal Kafle, and Christopher Kanan. Tallyqa: Answering complex counting ques- tions. InProceedings of the AAAI Conference on Artificial Intelligence, pages 8076–8084, 2019
work page 2019
-
[2]
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Song- cen Xu, Changrui Chen, Chunsheng Wu, et al. Llava-onevision-1.5: Fully open framework for de- mocratized multimodal training.arXiv preprint arXiv:2509.23661, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Claude 3.7 Sonnet.https://www.anthropic.com/claude/sonnet, 2025
Anthropic. Claude 3.7 Sonnet.https://www.anthropic.com/claude/sonnet, 2025. Ac- cessed: 2025-05-10
work page 2025
-
[4]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 29
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Llama-nemotron: Efficient reasoning models
Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, et al. Llama-nemotron: Efficient reasoning models. arXiv preprint arXiv:2505.00949, 2025
-
[7]
InternVL-3: A Vision-Language Foundation Model for Continued Learning, 2024
Zhaoyang Chen, Yichi Zhang, Ruijie Quan, Zuchao Li, Geng-Xin Miao, Hai-Tao Zheng, Ziyue Wang, Guansong Lu, Jing Wen, Jia-Qi Lin, Wei-Shi Zheng, Ping Luo, and Wen-Guan Wang. InternVL-3: A Vision-Language Foundation Model for Continued Learning, 2024
work page 2024
-
[8]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024
work page 2024
-
[9]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, G Heigold, S Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2020
work page 2020
-
[10]
Multi-modal agent tuning: Building a vlm-driven agent for efficient tool usage
Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaojian Ma, Tao Yuan, Yue Fan, Yuwei Wu, Yunde Jia, Song- Chun Zhu, and Qing Li. Multi-modal agent tuning: Building a vlm-driven agent for efficient tool usage. InProceedings of the International Conference on Learning Representations, 2025
work page 2025
-
[11]
Gemini Team and Google. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodal- ity, Long Context, and Next Generation Agentic Capabilities, 2025
work page 2025
-
[12]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shi- rong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024. 30
work page 2024
-
[14]
Visual program distillation: Distilling tools and programmatic reasoning into vision-language models
Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, and Ariel Fuxman. Visual program distillation: Distilling tools and programmatic reasoning into vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9590–9601, 2024
work page 2024
-
[15]
Zhongjian Hu, Peng Yang, Yuanshuang Jiang, and Zijian Bai. Prompting large language model with context and pre-answer for knowledge-based vqa.Pattern Recognition, 151:110399, 2024
work page 2024
-
[16]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6700–6709, 2019
work page 2019
-
[17]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Hel- yar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Xiao Ke, Peirong Xu, and Wenzhong Guo. Language–image consistency augmentation and distillation network for visual grounding.Pattern Recognition, 166:111663, 2025
work page 2025
-
[20]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023
work page 2023
-
[21]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yan- wei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Geng Li, Jinglin Xu, Yunzhen Zhao, and Yuxin Peng. Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9098–9108, 2025. 31
work page 2025
-
[23]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. InProceedings of the International Conference on Machine Learning, pages 19730–19742, 2023
work page 2023
-
[24]
Multi- modal arXiv: A dataset for improving scientific comprehension of large vision-language models
Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multi- modal arXiv: A dataset for improving scientific comprehension of large vision-language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14369–14387, 2024
work page 2024
-
[25]
Reward-guided speculative decoding for efficient llm reasoning.arXiv preprint arXiv:2501.19324, 2025
Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, and Caiming Xiong. Reward-guided speculative decoding for efficient llm reasoning.arXiv preprint arXiv:2501.19324, 2025
-
[26]
Llava- next: Improved reasoning, ocr, and world knowledge, January 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava- next: Improved reasoning, ocr, and world knowledge, January 2024
work page 2024
-
[27]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, volume 36, 2024
work page 2024
-
[28]
Shiyin Liu, Bo Shi, Ruijie Chen, Jian Shi, Junfeng Li, Jinsong Tang, Liujun Tang, Han Zhang, Zonglin Lu, Ke Sun, and Qi Chen. UniVG-R1: Reasoning guided universal visual grounding with reinforce- ment learning.arXiv preprint arXiv:2506.12151, 2025
-
[29]
Shiyin Liu, Bo Shi, Ruijie Chen, Jian Shi, Junfeng Li, Jinsong Tang, Liujun Tang, Han Zhang, Zonglin Lu, Ke Sun, and Qi Chen. VisualToolAgent (VisTA): A reinforcement learning framework for visual tool selection.arXiv preprint arXiv:2506.12152, 2025
-
[30]
Ziyang Ma, Yibo Song, Tiannan Su, Wenhao Li, Zesong Liu, Yuan Ren, Min Zhou, Shuai Yang, and Rongrong He. Pope: Parallel-object-property-evaluation benchmark for large language models.arXiv preprint arXiv:2209.03058, 2022
-
[31]
RouteLLM: Learning to Route LLMs with Preference Data
Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665, 2024. 32
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
OpenAI. Gpt-4v(ision) system card. Technical report, OpenAI, 2023
work page 2023
-
[33]
Openai o3 and o4-mini system card
OpenAI. Openai o3 and o4-mini system card. Technical report, OpenAI, 2025
work page 2025
-
[34]
Nicola Pezzotti, Thomas Höllt, Jan Van Gemert, Boudewijn PF Lelieveldt, Elmar Eisemann, and Anna Vilanova. Deepeyes: Progressive visual analytics for designing deep neural networks.IEEE transactions on visualization and computer graphics, 24(1):98–108, 2017
work page 2017
-
[35]
GRIT: Teaching MLLMs to think with images.arXiv preprint arXiv:2506.11993, 2025
Shiquan Qiu, Yixuan Liu, Honggang Yang, Zhaoyang Wu, Guangzhi Sun, Guoli Lv, Ying Jiang, Xiaoyu Li, Siyuan He, Xiang Gao, Yan Lu, Guangzong Li, and Bin Cui. GRIT: Teaching MLLMs to think with images.arXiv preprint arXiv:2506.11993, 2025
-
[36]
Tarr, Aviral Kumar, and Katerina Fragkiadaki
Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J. Tarr, Aviral Kumar, and Katerina Fragkiadaki. Grounded reinforcement learning for visual reasoning.arXiv preprint arXiv:2505.23678, 2025
-
[37]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
VGR: Visual Grounded Reasoning
Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, and Jun Xiao. VGR: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Qunbo Wang, Jing Liu, and Wenjun Wu. Coordinating explicit and implicit knowledge for knowledge- based vqa.Pattern Recognition, 151:110368, 2024
work page 2024
-
[41]
Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models.arXiv preprint arXiv:2408.15556, 2024. 33
-
[42]
Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025
-
[43]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022
work page 2022
-
[44]
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence
Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spatialscore: Towards unified evaluation for multimodal spatial understanding.arXiv preprint arXiv:2505.17012, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Penghao Wu and Saining Xie.v ⋆: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024
work page 2024
-
[46]
Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images
Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, and Gao Huang. Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images.arXiv preprint arXiv:2403.11703, 2024
-
[47]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayi- heng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InProceedings of the International Conference on Learning Representations, 2023. 34
work page 2023
-
[51]
A Survey on Multimodal Large Language Models
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.arXiv preprint arXiv:2306.13549, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Guoqing Zhang, Shichao Kan, Lu Shi, Wanru Xu, Gaoyun An, and Yigang Cen. Cross-scene visual context parsing with large vision-language model.Pattern Recognition, page 111641, 2025
work page 2025
-
[53]
Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, Liang Wang, Rong Jin, and Tieniu Tan. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for hu- mans?arXiv preprint arXiv:2408.13257, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Automatic Chain of Thought Prompting in Large Language Models
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models.arXiv preprint arXiv:2210.03493, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[56]
Jinliang Zheng, Jianxiong Li, Sijie Cheng, Yinan Zheng, Jiaming Li, Jihao Liu, Yu Liu, Jingjing Liu, and Xianyuan Zhan. Instruction-guided visual masking.Advances in neural information processing systems, 37:126004–126031, 2024
work page 2024
-
[57]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Heng Fan, Qinghua Hu, and Haibin Ling. Detection and tracking meet drones challenge.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 44(11):7380–7399, 2021. 35
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.