pith. sign in

arxiv: 2605.18740 · v2 · pith:AM5CTMEEnew · submitted 2026-05-18 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Pith reviewed 2026-05-20 10:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG
keywords Vision-OPDself-distillationmultimodal LLMsfine-grained visual understandingon-policy learningregional-to-global perception gapimage crops
0
0 comments X p. Extension
pith:AM5CTMEE Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{AM5CTMEE}

Prints a linked pith:AM5CTMEE badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Vision-OPD lets MLLMs internalize fine-grained visual focus by self-distilling from their own evidence-centered crops to full images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies a regional-to-global perception gap in multimodal large language models, where the same model answers fine-grained questions more accurately from focused image crops than from full images. To address this, it introduces Vision-OPD, a self-distillation method that uses the crop-conditioned version of the model as a teacher to guide the full-image version through on-policy rollouts, reducing differences in their token predictions. This approach allows the model to learn better attention to relevant details without any external models, labels, or additional tools during inference. Experiments demonstrate that models trained this way perform competitively against much larger systems on fine-grained visual benchmarks.

Core claim

Vision-OPD transfers the privileged perception from a crop-conditioned teacher policy to a full-image student policy by minimizing token-level divergence between their next-token distributions along the student's on-policy rollouts, enabling the MLLM to internalize the benefits of visual zooming internally.

What carries the argument

On-policy self-distillation from a crop-conditioned teacher to a full-image student within the same MLLM, minimizing divergence on generated rollouts to close the regional-to-global perception gap.

If this is right

  • The trained model performs better on fine-grained visual tasks using only full images.
  • It eliminates the need for external zooming or cropping tools at inference time.
  • Performance reaches levels competitive with larger or agentic models.
  • The method works without ground-truth labels or reward models.
  • Regional perception advantages can be internalized into global processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This could lead to more efficient vision-language models that do not require high-resolution processing for all tasks.
  • Similar self-distillation might apply to other sensory modalities or perception challenges in AI.
  • Exploring variations in how crops are selected could further optimize the transfer process.

Load-bearing premise

The performance advantage on evidence-centered crops over full images stems from a focus problem that can be transferred via next-token distribution matching rather than from inherent differences in recognition capability.

What would settle it

Running the Vision-OPD training on a model and observing no gain or a loss in accuracy on fine-grained visual understanding benchmarks compared to the original model would falsify the effectiveness of the distillation approach.

Figures

Figures reproduced from arXiv: 2605.18740 by Hongyu Lin, Jie Lou, Le Sun, Qianhao Yuan, Xianpei Han, Xing Yu, Yaojie Lu.

Figure 1
Figure 1. Figure 1: Average scores across fine-grained visual understanding benchmarks, including V* Bench, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A case of the regional-to-global gap, based on Qwen3.5-9B. The global image input leads to the wrong answer, while the cropped region input yields the correct answer. 45 50 55 60 65 70 75 80 Accuracy (%) Qwen3.5-4B Qwen3.5-9B GLM-4.6V GPT-5.4 Gemini-3.1-Pro +21.7 +19.5 +22.1 +19.3 +18.1 Gap Global Regional [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of Vision-OPD. Left: Fine-grained visual questions are generated on evidence￾centered crops and grounded back to the full image via bounding-box overlay. Right: A teacher policy pT (· | xcrop) and a student policy pS(· | xglobal) are instantiated from the same MLLM. The student generates on-policy rollouts y ∼ pS, and the per-token divergence D(pT ∥pS) along these rollouts provides dense supervisi… view at source ↗
Figure 5
Figure 5. Figure 5: Regional-to-global gap during Vision￾OPD training. A lower gap indicates that the model can better recover crop-visible evidence from the full image. full image. To test whether Vision-OPD addresses this bottleneck during training, we use the same comparison as in Section 3.1: each checkpoint answers the same question with the full image as input and with the evidence-centered crop as input. We track the r… view at source ↗
Figure 6
Figure 6. Figure 6: Inference speed comparison. Vision-OPD-9B achieves faster inference than agentic [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned on evidence-centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision-OPD (Vision On-Policy Distillation), a regional-to-global self-distillation framework that transfers the model's own privileged regional perception to its full-image policy. Vision-OPD instantiates two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts, and Vision-OPD minimizes token-level divergence between the teacher and student next-token distributions along these rollouts. This enables the model to internalize the benefit of visual zooming without external teacher models, ground-truth labels, reward verifiers, or inference-time tool use. Experiments on multiple fine-grained visual understanding benchmarks show that Vision-OPD models achieve competitive or superior performance against much larger open-source, closed-source, and "Thinking-with-Images" agentic models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that MLLMs exhibit a regional-to-global perception gap, answering fine-grained questions more accurately on evidence-centered crops than full images. It proposes Vision-OPD, an on-policy self-distillation method that trains a full-image student policy to match the next-token distributions of a crop-conditioned teacher policy (instantiated from the same MLLM) along student-generated rollouts, thereby internalizing zooming benefits without external teachers, labels, verifiers, or inference-time tools. Experiments reportedly show competitive or superior results on fine-grained visual benchmarks versus larger models and agentic baselines.

Significance. If the regional-to-global gap holds and the distillation transfers it without implicit supervision in crop construction, the result would be significant: it offers a label-free, model-internal route to improve detail-oriented multimodal reasoning, potentially reducing reliance on scale or external agents while remaining compatible with existing MLLM training pipelines.

major comments (2)
  1. §3.1 (Crop-conditioned teacher construction): the procedure for selecting 'evidence-centered crops' must be specified in detail; if crop selection uses answer-derived heuristics, model-based region proposals, or any post-hoc verification that encodes the fine-grained signal, the claimed absence of ground-truth labels or external privilege is undermined and the regional-to-global gap is no longer purely emergent.
  2. §4 (Experiments and ablations): the central performance claims rest on observed gaps between Vision-OPD and baselines, yet no ablations isolate the contribution of on-policy rollouts versus simple crop augmentation, and no statistical significance or variance estimates across runs are reported; without these controls the attribution of gains to the distillation procedure remains unverified.
minor comments (2)
  1. Notation for the token-level divergence loss (Eq. 3 or equivalent) should explicitly state whether KL is computed only on student-generated tokens or includes teacher-forced tokens.
  2. Figure 2 (method overview) would benefit from an explicit arrow or label showing the on-policy rollout path from student to teacher comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [—] §3.1 (Crop-conditioned teacher construction): the procedure for selecting 'evidence-centered crops' must be specified in detail; if crop selection uses answer-derived heuristics, model-based region proposals, or any post-hoc verification that encodes the fine-grained signal, the claimed absence of ground-truth labels or external privilege is undermined and the regional-to-global gap is no longer purely emergent.

    Authors: We agree that full transparency on crop construction is essential. In the revised manuscript we will expand §3.1 with a complete algorithmic description and pseudocode of the evidence-centered crop procedure. The selection operates without access to ground-truth answers, without post-hoc verification against the answer, and without any mechanism that injects the fine-grained supervisory signal into the crop itself. This preserves the claim that the observed regional-to-global gap is emergent from the MLLM’s own perception rather than from privileged crop construction. We will also add an explicit statement confirming the absence of external labels or verifiers at crop-generation time. revision: yes

  2. Referee: [—] §4 (Experiments and ablations): the central performance claims rest on observed gaps between Vision-OPD and baselines, yet no ablations isolate the contribution of on-policy rollouts versus simple crop augmentation, and no statistical significance or variance estimates across runs are reported; without these controls the attribution of gains to the distillation procedure remains unverified.

    Authors: We acknowledge that isolating the on-policy component and providing statistical context would strengthen attribution. In the revision we will add a new ablation in §4 that directly compares (i) the full Vision-OPD pipeline against (ii) a simple crop-augmentation baseline that feeds crops to the student without on-policy rollouts or distillation. We will also report mean performance and standard deviation over three independent training runs with different random seeds, together with error bars on the main benchmark tables. These results will appear in the main paper and supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in Vision-OPD derivation chain

full rationale

The paper's derivation starts from an empirical observation of a regional-to-global perception gap (same MLLM performs better on evidence-centered crops than full images) and proceeds to a self-distillation procedure that instantiates crop-conditioned and full-image policies from the identical base MLLM, then minimizes token-level divergence along the student's on-policy rollouts. This chain does not reduce any claimed result to its inputs by construction: the crop advantage is presented as an independent, testable fact rather than a definitional premise, the distillation objective is a standard on-policy KL-style transfer that does not presuppose the final performance gain, and no self-citation or uniqueness theorem is invoked to force the method. The approach remains self-contained against external benchmarks because the training signal derives from differential conditioning on the same model rather than from fitted parameters renamed as predictions or from externally privileged labels.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the empirical observation of a crop advantage and the standard assumption that aligning next-token distributions improves policy behavior; no new physical entities or ad-hoc constants are introduced.

axioms (1)
  • domain assumption The crop-conditioned version of the MLLM produces superior next-token distributions for fine-grained questions relative to the full-image version.
    This observation is invoked to justify why distilling from the crop policy should improve the full-image policy.

pith-pipeline@v0.9.0 · 5789 in / 1214 out tokens · 48477 ms · 2026-05-20T10:53:32.140952+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 22 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

  2. [2]

    Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

  3. [3]

    Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

    Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, et al. Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

  4. [4]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161, 2025

  5. [5]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  6. [6]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

  7. [7]

    Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452, 2025

    Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452, 2025

  8. [8]

    Gemini 3

    Google. Gemini 3. https://blog.google/products-and-platforms/products/ gemini/gemini-3/, 2025

  9. [9]

    Gemini 3.1 pro

    Google. Gemini 3.1 pro. https://deepmind.google/models/model-cards/ gemini-3-1-pro/, 2026

  10. [10]

    Minillm: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe twelfth international conference on learning representations, 2024. 10

  11. [11]

    Reinforced Self-Training (ReST) for Language Modeling

    Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling.arXiv preprint arXiv:2308.08998, 2023

  12. [12]

    DeepEyesV2: Toward Agentic Multimodal Model

    Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025

  13. [13]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

  14. [14]

    Codev: Code with images for faithful visual reasoning via tool-aware policy optimization.arXiv preprint arXiv:2511.19661, 2025

    Xinhai Hou, Shaoyuan Xu, Manan Biyani, Moyan Li, Jia Liu, Todd C Hollon, and Bryan Wang. Codev: Code with images for faithful visual reasoning via tool-aware policy optimization.arXiv preprint arXiv:2511.19661, 2025

  15. [15]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

  16. [16]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

  17. [17]

    Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

    Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025

  18. [18]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023

  19. [19]

    Vlm-fo1: Bridging the gap between high-level reasoning and fine-grained perception in vlms

    Peng Liu, Haozhan Shen, Chunxin Fang, Zhicheng Sun, Jiajia Liao, and Tiancheng Zhao. Vlm-fo1: Bridging the gap between high-level reasoning and fine-grained perception in vlms. arXiv preprint arXiv:2509.25916, 2025

  20. [20]

    HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

    Xianjie Liu, Yiman Hu, Yixiong Zou, Liang Wu, Jian Xu, and Bo Zheng. Hide: Rethinking the zoom-in method in high resolution mllms via hierarchical decoupling.arXiv preprint arXiv:2510.00054, 2025

  21. [21]

    Chain-of-spot: Interactive reasoning improves large vision-language models.arXiv preprint arXiv:2403.12966, 2024

    Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, and Jiwen Lu. Chain-of-spot: Interactive reasoning improves large vision-language models.arXiv preprint arXiv:2403.12966, 2024

  22. [22]

    On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025

    Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy- distillation

  23. [23]

    Textcot: Zoom-in for enhanced multimodal text-rich image understanding.ACM Transactions on Multimedia Computing, Communications and Applications, 22(4):1–19, 2026

    Bozhi Luan, Hao Feng, Hong Chen, Yonghui Wang, Wengang Zhou, and Houqiang Li. Textcot: Zoom-in for enhanced multimodal text-rich image understanding.ACM Transactions on Multimedia Computing, Communications and Applications, 22(4):1–19, 2026

  24. [24]

    Beyond unimodal shortcuts: Mllms as cross-modal reasoners for grounded named entity recognition.arXiv preprint arXiv:2602.04486, 2026

    Jinlong Ma, Yu Zhang, Xuefeng Bai, Kehai Chen, Yuwei Wang, Zeming Liu, Jun Yu, and Min Zhang. Beyond unimodal shortcuts: Mllms as cross-modal reasoners for grounded named entity recognition.arXiv preprint arXiv:2602.04486, 2026

  25. [25]

    Gpt-5.1.https://openai.com/index/gpt-5-1/, 2025

    OpenAI. Gpt-5.1.https://openai.com/index/gpt-5-1/, 2025

  26. [26]

    Introducing gpt-5.2

    OpenAI. Introducing gpt-5.2. https://openai.com/index/introducing-gpt-5-2/ , 2025

  27. [27]

    Introducing gpt-5.4

    OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026. 11

  28. [28]

    Patch matters: Training-free fine-grained image caption enhancement via local perception

    Ruotian Peng, Haiying He, Yake Wei, Yandong Wen, and Di Hu. Patch matters: Training-free fine-grained image caption enhancement via local perception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3963–3973, 2025

  29. [29]

    In-context editing: Learning knowledge from self-induced distributions

    Siyuan Qi, Bangcheng Yang, Kailin Jiang, Xiaobo Wang, Jiaqi Li, Yifan Zhong, Yaodong Yang, and Zilong Zheng. In-context editing: Learning knowledge from self-induced distributions. arXiv preprint arXiv:2406.11194, 2024

  30. [30]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

  31. [31]

    A reduction of imitation learning and structured prediction to no-regret online learning

    Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth interna- tional conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

  32. [32]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  33. [33]

    Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration

    Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, and Jianwei Yin. Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6613–6629, 2025

  34. [34]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  35. [35]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  36. [36]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

  37. [37]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024

  38. [38]

    Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

    Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

  39. [39]

    Grasp any region: Towards precise, contextual pixel understanding for multimodal llms.arXiv preprint arXiv:2510.18876, 2025

    Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Jiani Zheng, Ye Tian, Jiahao Meng, Zilong Huang, et al. Grasp any region: Towards precise, contextual pixel understanding for multimodal llms.arXiv preprint arXiv:2510.18876, 2025

  40. [40]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

  41. [41]

    VGR: Visual Grounded Reasoning

    Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, et al. Vgr: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025

  42. [42]

    Hopchain: Multi-hop data synthesis for generalizable vision-language reasoning.arXiv preprint arXiv:2603.17024, 2026

    Shenzhi Wang, Shixuan Liu, Jing Zhou, Chang Gao, Xiong-Hui Chen, Binghai Wang, An Yang, Shiji Song, Bowen Yu, Gao Huang, et al. Hopchain: Multi-hop data synthesis for generalizable vision-language reasoning.arXiv preprint arXiv:2603.17024, 2026. 12

  43. [43]

    Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

    Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7907–7915, 2025

  44. [44]

    Vg-refiner: Towards tool-refined referring grounded reasoning via agentic reinforcement learning.arXiv preprint arXiv:2512.06373, 2025

    Yuji Wang, Wenlong Liu, Jingxuan Niu, Haoji Zhang, and Yansong Tang. Vg-refiner: Towards tool-refined referring grounded reasoning via agentic reinforcement learning.arXiv preprint arXiv:2512.06373, 2025

  45. [45]

    Advancing multimodal reasoning via reinforcement learning with cold start

    Lai Wei, Yuting Li, Kaipeng Zheng, Chen Wang, Yue Wang, Linghe Kong, Lichao Sun, and Weiran Huang. Advancing multimodal reasoning via reinforcement learning with cold start. arXiv preprint arXiv:2505.22334, 2025

  46. [46]

    Zooming without zooming: Region-to-image distillation for fine-grained multimodal perception.arXiv preprint arXiv:2602.11858, 2026

    Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li, Huijia Zhu, Weiqiang Wang, Linghe Kong, Yue Wang, et al. Zooming without zooming: Region-to-image distillation for fine-grained multimodal perception.arXiv preprint arXiv:2602.11858, 2026

  47. [47]

    Perception in reflection.arXiv preprint arXiv:2504.07165, 2025

    Yana Wei, Liang Zhao, Kangheng Lin, En Yu, Yuang Peng, Runpei Dong, Jianjian Sun, Haoran Wei, Zheng Ge, Xiangyu Zhang, et al. Perception in reflection.arXiv preprint arXiv:2504.07165, 2025

  48. [48]

    Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

    Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965, 2025

  49. [49]

    V*: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

  50. [50]

    Mimo-vl technical report, 2025

    LLM-Core-Team Xiaomi. Mimo-vl technical report, 2025. URL https://arxiv.org/abs/ 2506.03569

  51. [51]

    Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling.arXiv preprint arXiv:2410.11325, 2024

    Wenda Xu, Rujun Han, Zifeng Wang, Long T Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-Yu Lee, and Tomas Pfister. Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling.arXiv preprint arXiv:2410.11325, 2024

  52. [52]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  53. [53]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  54. [54]

    MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

    Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025

  55. [55]

    Zoom-refine: Boosting high-resolution multimodal understanding via localized zoom and self-refinement.arXiv preprint arXiv:2506.01663, 2025

    Xuan Yu, Dayan Guan, and Yanfeng Gu. Zoom-refine: Boosting high-resolution multimodal understanding via localized zoom and self-refinement.arXiv preprint arXiv:2506.01663, 2025

  56. [56]

    Visual reasoning tracer: Object-level grounded reasoning benchmark.arXiv preprint arXiv:2512.05091, 2025

    Haobo Yuan, Yueyi Sun, Yanwei Li, Tao Zhang, Xueqing Deng, Henghui Ding, Lu Qi, Anran Wang, Xiangtai Li, and Ming-Hsuan Yang. Visual reasoning tracer: Object-level grounded reasoning benchmark.arXiv preprint arXiv:2512.05091, 2025

  57. [57]

    Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

  58. [58]

    MLLMs know where to look: Training-free perception of small visual details with multimodal LLMs

    Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. MLLMs know where to look: Training-free perception of small visual details with multimodal LLMs. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=DgaY5mDdmT. 13

  59. [59]

    Finers: Fine-grained reasoning and segmentation of small objects with reinforcement learning

    Lu Zhang, Jiazuo Yu, Haomiao Xiong, Ping Hu, Yunzhi Zhuge, Huchuan Lu, and You He. Finers: Fine-grained reasoning and segmentation of small objects with reinforcement learning. arXiv preprint arXiv:2510.21311, 2025

  60. [60]

    Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv e-prints, pages arXiv–2505, 2025

    Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv e-prints, pages arXiv–2505, 2025

  61. [61]

    MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

    Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024

  62. [62]

    Thyme: Think Beyond Images

    Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025

  63. [63]

    Skywork-r1v4: Toward agentic multimodal intelligence through interleaved thinking with images and deepresearch.arXiv preprint arXiv:2512.02395, 2025

    Yifan Zhang, Liang Hu, Haofeng Sun, Peiyu Wang, Yichen Wei, Shukang Yin, Jiangbo Pei, Wei Shen, Peng Xia, Yi Peng, et al. Skywork-r1v4: Toward agentic multimodal intelligence through interleaved thinking with images and deepresearch.arXiv preprint arXiv:2512.02395, 2025

  64. [64]

    Evaluating and steering modality preferences in multimodal large language model

    Yu Zhang, Jinlong Ma, Yongshuai Hou, Xuefeng Bai, Kehai Chen, Yang Xiang, Jun Yu, and Min Zhang. Evaluating and steering modality preferences in multimodal large language model. arXiv preprint arXiv:2505.20977, 2025

  65. [65]

    Instruction Anchor: Dissecting the Mechanistic Dynamics of Modality Arbitration

    Yu Zhang, Mufan Xu, Xuefeng Bai, Pengfei Zhang, Yang Xiang, Min Zhang, et al. In- struction anchors: Dissecting the causal dynamics of modality arbitration.arXiv preprint arXiv:2602.03677, 2026

  66. [66]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

  67. [67]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

  68. [68]

    Thinking-with-Images

    Qiji Zhou, Ruochen Zhou, Zike Hu, Panzhong Lu, Siyang Gao, and Yue Zhang. Image-of- thought prompting for visual reasoning refinement in multimodal large language models.arXiv preprint arXiv:2405.13872, 2024. 14 A Inference speed comparison Vision-OPD-9B DeepEyes Thyme DeepEyesV2 SenseNova-MARS 0.0 0.5 1.0 1.5 2.0 2.5 3.0Inference Speed (Samples/s) Figure...