Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

arxiv: 2605.18740 · v2 · pith:AM5CTMEEnew · submitted 2026-05-18 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Qianhao Yuan , Jie Lou , Xing Yu , Hongyu Lin , Le Sun , Xianpei Han , Yaojie Lu This is my paper

Pith reviewed 2026-05-20 10:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords Vision-OPDself-distillationmultimodal LLMsfine-grained visual understandingon-policy learningregional-to-global perception gapimage crops

0 comments p. Extension

pith:AM5CTMEE Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{AM5CTMEE}

Prints a linked pith:AM5CTMEE badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Vision-OPD lets MLLMs internalize fine-grained visual focus by self-distilling from their own evidence-centered crops to full images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies a regional-to-global perception gap in multimodal large language models, where the same model answers fine-grained questions more accurately from focused image crops than from full images. To address this, it introduces Vision-OPD, a self-distillation method that uses the crop-conditioned version of the model as a teacher to guide the full-image version through on-policy rollouts, reducing differences in their token predictions. This approach allows the model to learn better attention to relevant details without any external models, labels, or additional tools during inference. Experiments demonstrate that models trained this way perform competitively against much larger systems on fine-grained visual benchmarks.

Core claim

Vision-OPD transfers the privileged perception from a crop-conditioned teacher policy to a full-image student policy by minimizing token-level divergence between their next-token distributions along the student's on-policy rollouts, enabling the MLLM to internalize the benefits of visual zooming internally.

What carries the argument

On-policy self-distillation from a crop-conditioned teacher to a full-image student within the same MLLM, minimizing divergence on generated rollouts to close the regional-to-global perception gap.

If this is right

The trained model performs better on fine-grained visual tasks using only full images.
It eliminates the need for external zooming or cropping tools at inference time.
Performance reaches levels competitive with larger or agentic models.
The method works without ground-truth labels or reward models.
Regional perception advantages can be internalized into global processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could lead to more efficient vision-language models that do not require high-resolution processing for all tasks.
Similar self-distillation might apply to other sensory modalities or perception challenges in AI.
Exploring variations in how crops are selected could further optimize the transfer process.

Load-bearing premise

The performance advantage on evidence-centered crops over full images stems from a focus problem that can be transferred via next-token distribution matching rather than from inherent differences in recognition capability.

What would settle it

Running the Vision-OPD training on a model and observing no gain or a loss in accuracy on fine-grained visual understanding benchmarks compared to the original model would falsify the effectiveness of the distillation approach.

Figures

Figures reproduced from arXiv: 2605.18740 by Hongyu Lin, Jie Lou, Le Sun, Qianhao Yuan, Xianpei Han, Xing Yu, Yaojie Lu.

**Figure 2.** Figure 2: A case of the regional-to-global gap, based on Qwen3.5-9B. The global image input leads to the wrong answer, while the cropped region input yields the correct answer. 45 50 55 60 65 70 75 80 Accuracy (%) Qwen3.5-4B Qwen3.5-9B GLM-4.6V GPT-5.4 Gemini-3.1-Pro +21.7 +19.5 +22.1 +19.3 +18.1 Gap Global Regional [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Overview of Vision-OPD. Left: Fine-grained visual questions are generated on evidencecentered crops and grounded back to the full image via bounding-box overlay. Right: A teacher policy pT (· | xcrop) and a student policy pS(· | xglobal) are instantiated from the same MLLM. The student generates on-policy rollouts y ∼ pS, and the per-token divergence D(pT ∥pS) along these rollouts provides dense supervisi… view at source ↗

**Figure 5.** Figure 5: Regional-to-global gap during VisionOPD training. A lower gap indicates that the model can better recover crop-visible evidence from the full image. full image. To test whether Vision-OPD addresses this bottleneck during training, we use the same comparison as in Section 3.1: each checkpoint answers the same question with the full image as input and with the evidence-centered crop as input. We track the r… view at source ↗

**Figure 6.** Figure 6: Inference speed comparison. Vision-OPD-9B achieves faster inference than agentic [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned on evidence-centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision-OPD (Vision On-Policy Distillation), a regional-to-global self-distillation framework that transfers the model's own privileged regional perception to its full-image policy. Vision-OPD instantiates two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts, and Vision-OPD minimizes token-level divergence between the teacher and student next-token distributions along these rollouts. This enables the model to internalize the benefit of visual zooming without external teacher models, ground-truth labels, reward verifiers, or inference-time tool use. Experiments on multiple fine-grained visual understanding benchmarks show that Vision-OPD models achieve competitive or superior performance against much larger open-source, closed-source, and "Thinking-with-Images" agentic models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that MLLMs exhibit a regional-to-global perception gap, answering fine-grained questions more accurately on evidence-centered crops than full images. It proposes Vision-OPD, an on-policy self-distillation method that trains a full-image student policy to match the next-token distributions of a crop-conditioned teacher policy (instantiated from the same MLLM) along student-generated rollouts, thereby internalizing zooming benefits without external teachers, labels, verifiers, or inference-time tools. Experiments reportedly show competitive or superior results on fine-grained visual benchmarks versus larger models and agentic baselines.

Significance. If the regional-to-global gap holds and the distillation transfers it without implicit supervision in crop construction, the result would be significant: it offers a label-free, model-internal route to improve detail-oriented multimodal reasoning, potentially reducing reliance on scale or external agents while remaining compatible with existing MLLM training pipelines.

major comments (2)

§3.1 (Crop-conditioned teacher construction): the procedure for selecting 'evidence-centered crops' must be specified in detail; if crop selection uses answer-derived heuristics, model-based region proposals, or any post-hoc verification that encodes the fine-grained signal, the claimed absence of ground-truth labels or external privilege is undermined and the regional-to-global gap is no longer purely emergent.
§4 (Experiments and ablations): the central performance claims rest on observed gaps between Vision-OPD and baselines, yet no ablations isolate the contribution of on-policy rollouts versus simple crop augmentation, and no statistical significance or variance estimates across runs are reported; without these controls the attribution of gains to the distillation procedure remains unverified.

minor comments (2)

Notation for the token-level divergence loss (Eq. 3 or equivalent) should explicitly state whether KL is computed only on student-generated tokens or includes teacher-forced tokens.
Figure 2 (method overview) would benefit from an explicit arrow or label showing the on-policy rollout path from student to teacher comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [—] §3.1 (Crop-conditioned teacher construction): the procedure for selecting 'evidence-centered crops' must be specified in detail; if crop selection uses answer-derived heuristics, model-based region proposals, or any post-hoc verification that encodes the fine-grained signal, the claimed absence of ground-truth labels or external privilege is undermined and the regional-to-global gap is no longer purely emergent.

Authors: We agree that full transparency on crop construction is essential. In the revised manuscript we will expand §3.1 with a complete algorithmic description and pseudocode of the evidence-centered crop procedure. The selection operates without access to ground-truth answers, without post-hoc verification against the answer, and without any mechanism that injects the fine-grained supervisory signal into the crop itself. This preserves the claim that the observed regional-to-global gap is emergent from the MLLM’s own perception rather than from privileged crop construction. We will also add an explicit statement confirming the absence of external labels or verifiers at crop-generation time. revision: yes
Referee: [—] §4 (Experiments and ablations): the central performance claims rest on observed gaps between Vision-OPD and baselines, yet no ablations isolate the contribution of on-policy rollouts versus simple crop augmentation, and no statistical significance or variance estimates across runs are reported; without these controls the attribution of gains to the distillation procedure remains unverified.

Authors: We acknowledge that isolating the on-policy component and providing statistical context would strengthen attribution. In the revision we will add a new ablation in §4 that directly compares (i) the full Vision-OPD pipeline against (ii) a simple crop-augmentation baseline that feeds crops to the student without on-policy rollouts or distillation. We will also report mean performance and standard deviation over three independent training runs with different random seeds, together with error bars on the main benchmark tables. These results will appear in the main paper and supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in Vision-OPD derivation chain

full rationale

The paper's derivation starts from an empirical observation of a regional-to-global perception gap (same MLLM performs better on evidence-centered crops than full images) and proceeds to a self-distillation procedure that instantiates crop-conditioned and full-image policies from the identical base MLLM, then minimizes token-level divergence along the student's on-policy rollouts. This chain does not reduce any claimed result to its inputs by construction: the crop advantage is presented as an independent, testable fact rather than a definitional premise, the distillation objective is a standard on-policy KL-style transfer that does not presuppose the final performance gain, and no self-citation or uniqueness theorem is invoked to force the method. The approach remains self-contained against external benchmarks because the training signal derives from differential conditioning on the same model rather than from fitted parameters renamed as predictions or from externally privileged labels.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the empirical observation of a crop advantage and the standard assumption that aligning next-token distributions improves policy behavior; no new physical entities or ad-hoc constants are introduced.

axioms (1)

domain assumption The crop-conditioned version of the MLLM produces superior next-token distributions for fine-grained questions relative to the full-image version.
This observation is invoked to justify why distilling from the crop policy should improve the full-image policy.

pith-pipeline@v0.9.0 · 5789 in / 1214 out tokens · 48477 ms · 2026-05-20T10:53:32.140952+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Vision-OPD instantiates two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts, and Vision-OPD minimizes token-level divergence between the teacher and student next-token distributions along these rollouts.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Vision-OPD, a regional-to-global self-distillation framework... without external teacher models, ground-truth labels, reward verifiers, or inference-time tool use.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 22 internal anchors

[1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

work page 2024
[2]

Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

work page 2024
[3]

Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, et al. Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

work page arXiv 2025
[4]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

work page 2026
[7]

Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452, 2025

Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452, 2025

work page arXiv 2025
[8]

Gemini 3

Google. Gemini 3. https://blog.google/products-and-platforms/products/ gemini/gemini-3/, 2025

work page 2025
[9]

Gemini 3.1 pro

Google. Gemini 3.1 pro. https://deepmind.google/models/model-cards/ gemini-3-1-pro/, 2026

work page 2026
[10]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe twelfth international conference on learning representations, 2024. 10

work page 2024
[11]

Reinforced Self-Training (ReST) for Language Modeling

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling.arXiv preprint arXiv:2308.08998, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Codev: Code with images for faithful visual reasoning via tool-aware policy optimization.arXiv preprint arXiv:2511.19661, 2025

Xinhai Hou, Shaoyuan Xu, Manan Biyani, Moyan Li, Jia Liu, Todd C Hollon, and Bryan Wang. Codev: Code with images for faithful visual reasoning via tool-aware policy optimization.arXiv preprint arXiv:2511.19661, 2025

work page arXiv 2025
[15]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

work page 2024
[16]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023

work page 2023
[19]

Vlm-fo1: Bridging the gap between high-level reasoning and fine-grained perception in vlms

Peng Liu, Haozhan Shen, Chunxin Fang, Zhicheng Sun, Jiajia Liao, and Tiancheng Zhao. Vlm-fo1: Bridging the gap between high-level reasoning and fine-grained perception in vlms. arXiv preprint arXiv:2509.25916, 2025

work page arXiv 2025
[20]

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

Xianjie Liu, Yiman Hu, Yixiong Zou, Liang Wu, Jian Xu, and Bo Zheng. Hide: Rethinking the zoom-in method in high resolution mllms via hierarchical decoupling.arXiv preprint arXiv:2510.00054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Chain-of-spot: Interactive reasoning improves large vision-language models.arXiv preprint arXiv:2403.12966, 2024

Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, and Jiwen Lu. Chain-of-spot: Interactive reasoning improves large vision-language models.arXiv preprint arXiv:2403.12966, 2024

work page arXiv 2024
[22]

On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025

Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy- distillation

work page doi:10.64434/tml.20251026 2025
[23]

Textcot: Zoom-in for enhanced multimodal text-rich image understanding.ACM Transactions on Multimedia Computing, Communications and Applications, 22(4):1–19, 2026

Bozhi Luan, Hao Feng, Hong Chen, Yonghui Wang, Wengang Zhou, and Houqiang Li. Textcot: Zoom-in for enhanced multimodal text-rich image understanding.ACM Transactions on Multimedia Computing, Communications and Applications, 22(4):1–19, 2026

work page 2026
[24]

Beyond unimodal shortcuts: Mllms as cross-modal reasoners for grounded named entity recognition.arXiv preprint arXiv:2602.04486, 2026

Jinlong Ma, Yu Zhang, Xuefeng Bai, Kehai Chen, Yuwei Wang, Zeming Liu, Jun Yu, and Min Zhang. Beyond unimodal shortcuts: Mllms as cross-modal reasoners for grounded named entity recognition.arXiv preprint arXiv:2602.04486, 2026

work page arXiv 2026
[25]

Gpt-5.1.https://openai.com/index/gpt-5-1/, 2025

OpenAI. Gpt-5.1.https://openai.com/index/gpt-5-1/, 2025

work page 2025
[26]

Introducing gpt-5.2

OpenAI. Introducing gpt-5.2. https://openai.com/index/introducing-gpt-5-2/ , 2025

work page 2025
[27]

Introducing gpt-5.4

OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026. 11

work page 2026
[28]

Patch matters: Training-free fine-grained image caption enhancement via local perception

Ruotian Peng, Haiying He, Yake Wei, Yandong Wen, and Di Hu. Patch matters: Training-free fine-grained image caption enhancement via local perception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3963–3973, 2025

work page 2025
[29]

In-context editing: Learning knowledge from self-induced distributions

Siyuan Qi, Bangcheng Yang, Kailin Jiang, Xiaobo Wang, Jiaqi Li, Yifan Zhong, Yaodong Yang, and Zilong Zheng. In-context editing: Learning knowledge from self-induced distributions. arXiv preprint arXiv:2406.11194, 2024

work page arXiv 2024
[30]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

work page 2026
[31]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth interna- tional conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

work page 2011
[32]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration

Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, and Jianwei Yin. Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6613–6629, 2025

work page 2025
[34]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

work page 2024
[37]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024

work page 2024
[38]

Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

work page arXiv 2025
[39]

Grasp any region: Towards precise, contextual pixel understanding for multimodal llms.arXiv preprint arXiv:2510.18876, 2025

Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Jiani Zheng, Ye Tian, Jiahao Meng, Zilong Huang, et al. Grasp any region: Towards precise, contextual pixel understanding for multimodal llms.arXiv preprint arXiv:2510.18876, 2025

work page arXiv 2025
[40]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

VGR: Visual Grounded Reasoning

Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, et al. Vgr: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Hopchain: Multi-hop data synthesis for generalizable vision-language reasoning.arXiv preprint arXiv:2603.17024, 2026

Shenzhi Wang, Shixuan Liu, Jing Zhou, Chang Gao, Xiong-Hui Chen, Binghai Wang, An Yang, Shiji Song, Bowen Yu, Gao Huang, et al. Hopchain: Multi-hop data synthesis for generalizable vision-language reasoning.arXiv preprint arXiv:2603.17024, 2026. 12

work page arXiv 2026
[43]

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7907–7915, 2025

work page 2025
[44]

Vg-refiner: Towards tool-refined referring grounded reasoning via agentic reinforcement learning.arXiv preprint arXiv:2512.06373, 2025

Yuji Wang, Wenlong Liu, Jingxuan Niu, Haoji Zhang, and Yansong Tang. Vg-refiner: Towards tool-refined referring grounded reasoning via agentic reinforcement learning.arXiv preprint arXiv:2512.06373, 2025

work page arXiv 2025
[45]

Advancing multimodal reasoning via reinforcement learning with cold start

Lai Wei, Yuting Li, Kaipeng Zheng, Chen Wang, Yue Wang, Linghe Kong, Lichao Sun, and Weiran Huang. Advancing multimodal reasoning via reinforcement learning with cold start. arXiv preprint arXiv:2505.22334, 2025

work page arXiv 2025
[46]

Zooming without zooming: Region-to-image distillation for fine-grained multimodal perception.arXiv preprint arXiv:2602.11858, 2026

Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li, Huijia Zhu, Weiqiang Wang, Linghe Kong, Yue Wang, et al. Zooming without zooming: Region-to-image distillation for fine-grained multimodal perception.arXiv preprint arXiv:2602.11858, 2026

work page arXiv 2026
[47]

Perception in reflection.arXiv preprint arXiv:2504.07165, 2025

Yana Wei, Liang Zhao, Kangheng Lin, En Yu, Yuang Peng, Runpei Dong, Jianjian Sun, Haoran Wei, Zheng Ge, Xiangyu Zhang, et al. Perception in reflection.arXiv preprint arXiv:2504.07165, 2025

work page arXiv 2025
[48]

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

V*: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

work page 2024
[50]

Mimo-vl technical report, 2025

LLM-Core-Team Xiaomi. Mimo-vl technical report, 2025. URL https://arxiv.org/abs/ 2506.03569

work page arXiv 2025
[51]

Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling.arXiv preprint arXiv:2410.11325, 2024

Wenda Xu, Rujun Han, Zifeng Wang, Long T Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-Yu Lee, and Tomas Pfister. Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling.arXiv preprint arXiv:2410.11325, 2024

work page arXiv 2024
[52]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Zoom-refine: Boosting high-resolution multimodal understanding via localized zoom and self-refinement.arXiv preprint arXiv:2506.01663, 2025

Xuan Yu, Dayan Guan, and Yanfeng Gu. Zoom-refine: Boosting high-resolution multimodal understanding via localized zoom and self-refinement.arXiv preprint arXiv:2506.01663, 2025

work page arXiv 2025
[56]

Visual reasoning tracer: Object-level grounded reasoning benchmark.arXiv preprint arXiv:2512.05091, 2025

Haobo Yuan, Yueyi Sun, Yanwei Li, Tao Zhang, Xueqing Deng, Henghui Ding, Lu Qi, Anran Wang, Xiangtai Li, and Ming-Hsuan Yang. Visual reasoning tracer: Object-level grounded reasoning benchmark.arXiv preprint arXiv:2512.05091, 2025

work page arXiv 2025
[57]

Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

work page 2022
[58]

MLLMs know where to look: Training-free perception of small visual details with multimodal LLMs

Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. MLLMs know where to look: Training-free perception of small visual details with multimodal LLMs. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=DgaY5mDdmT. 13

work page 2025
[59]

Finers: Fine-grained reasoning and segmentation of small objects with reinforcement learning

Lu Zhang, Jiazuo Yu, Haomiao Xiong, Ping Hu, Yunzhi Zhuge, Huchuan Lu, and You He. Finers: Fine-grained reasoning and segmentation of small objects with reinforcement learning. arXiv preprint arXiv:2510.21311, 2025

work page arXiv 2025
[60]

Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv e-prints, pages arXiv–2505, 2025

Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv e-prints, pages arXiv–2505, 2025

work page 2025
[61]

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

Thyme: Think Beyond Images

Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Skywork-r1v4: Toward agentic multimodal intelligence through interleaved thinking with images and deepresearch.arXiv preprint arXiv:2512.02395, 2025

Yifan Zhang, Liang Hu, Haofeng Sun, Peiyu Wang, Yichen Wei, Shukang Yin, Jiangbo Pei, Wei Shen, Peng Xia, Yi Peng, et al. Skywork-r1v4: Toward agentic multimodal intelligence through interleaved thinking with images and deepresearch.arXiv preprint arXiv:2512.02395, 2025

work page arXiv 2025
[64]

Evaluating and steering modality preferences in multimodal large language model

Yu Zhang, Jinlong Ma, Yongshuai Hou, Xuefeng Bai, Kehai Chen, Yang Xiang, Jun Yu, and Min Zhang. Evaluating and steering modality preferences in multimodal large language model. arXiv preprint arXiv:2505.20977, 2025

work page arXiv 2025
[65]

Instruction Anchor: Dissecting the Mechanistic Dynamics of Modality Arbitration

Yu Zhang, Mufan Xu, Xuefeng Bai, Pengfei Zhang, Yang Xiang, Min Zhang, et al. In- struction anchors: Dissecting the causal dynamics of modality arbitration.arXiv preprint arXiv:2602.03677, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[66]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[67]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Thinking-with-Images

Qiji Zhou, Ruochen Zhou, Zike Hu, Panzhong Lu, Siyang Gao, and Yue Zhang. Image-of- thought prompting for visual reasoning refinement in multimodal large language models.arXiv preprint arXiv:2405.13872, 2024. 14 A Inference speed comparison Vision-OPD-9B DeepEyes Thyme DeepEyesV2 SenseNova-MARS 0.0 0.5 1.0 1.5 2.0 2.5 3.0Inference Speed (Samples/s) Figure...

work page arXiv 2024

[1] [1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

work page 2024

[2] [2]

Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

work page 2024

[3] [3]

Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, et al. Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

work page arXiv 2025

[4] [4]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

work page 2026

[7] [7]

Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452, 2025

Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452, 2025

work page arXiv 2025

[8] [8]

Gemini 3

Google. Gemini 3. https://blog.google/products-and-platforms/products/ gemini/gemini-3/, 2025

work page 2025

[9] [9]

Gemini 3.1 pro

Google. Gemini 3.1 pro. https://deepmind.google/models/model-cards/ gemini-3-1-pro/, 2026

work page 2026

[10] [10]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe twelfth international conference on learning representations, 2024. 10

work page 2024

[11] [11]

Reinforced Self-Training (ReST) for Language Modeling

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling.arXiv preprint arXiv:2308.08998, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Codev: Code with images for faithful visual reasoning via tool-aware policy optimization.arXiv preprint arXiv:2511.19661, 2025

Xinhai Hou, Shaoyuan Xu, Manan Biyani, Moyan Li, Jia Liu, Todd C Hollon, and Bryan Wang. Codev: Code with images for faithful visual reasoning via tool-aware policy optimization.arXiv preprint arXiv:2511.19661, 2025

work page arXiv 2025

[15] [15]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

work page 2024

[16] [16]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[17] [17]

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023

work page 2023

[19] [19]

Vlm-fo1: Bridging the gap between high-level reasoning and fine-grained perception in vlms

Peng Liu, Haozhan Shen, Chunxin Fang, Zhicheng Sun, Jiajia Liao, and Tiancheng Zhao. Vlm-fo1: Bridging the gap between high-level reasoning and fine-grained perception in vlms. arXiv preprint arXiv:2509.25916, 2025

work page arXiv 2025

[20] [20]

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

Xianjie Liu, Yiman Hu, Yixiong Zou, Liang Wu, Jian Xu, and Bo Zheng. Hide: Rethinking the zoom-in method in high resolution mllms via hierarchical decoupling.arXiv preprint arXiv:2510.00054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Chain-of-spot: Interactive reasoning improves large vision-language models.arXiv preprint arXiv:2403.12966, 2024

Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, and Jiwen Lu. Chain-of-spot: Interactive reasoning improves large vision-language models.arXiv preprint arXiv:2403.12966, 2024

work page arXiv 2024

[22] [22]

On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025

Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy- distillation

work page doi:10.64434/tml.20251026 2025

[23] [23]

Textcot: Zoom-in for enhanced multimodal text-rich image understanding.ACM Transactions on Multimedia Computing, Communications and Applications, 22(4):1–19, 2026

Bozhi Luan, Hao Feng, Hong Chen, Yonghui Wang, Wengang Zhou, and Houqiang Li. Textcot: Zoom-in for enhanced multimodal text-rich image understanding.ACM Transactions on Multimedia Computing, Communications and Applications, 22(4):1–19, 2026

work page 2026

[24] [24]

Beyond unimodal shortcuts: Mllms as cross-modal reasoners for grounded named entity recognition.arXiv preprint arXiv:2602.04486, 2026

Jinlong Ma, Yu Zhang, Xuefeng Bai, Kehai Chen, Yuwei Wang, Zeming Liu, Jun Yu, and Min Zhang. Beyond unimodal shortcuts: Mllms as cross-modal reasoners for grounded named entity recognition.arXiv preprint arXiv:2602.04486, 2026

work page arXiv 2026

[25] [25]

Gpt-5.1.https://openai.com/index/gpt-5-1/, 2025

OpenAI. Gpt-5.1.https://openai.com/index/gpt-5-1/, 2025

work page 2025

[26] [26]

Introducing gpt-5.2

OpenAI. Introducing gpt-5.2. https://openai.com/index/introducing-gpt-5-2/ , 2025

work page 2025

[27] [27]

Introducing gpt-5.4

OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026. 11

work page 2026

[28] [28]

Patch matters: Training-free fine-grained image caption enhancement via local perception

Ruotian Peng, Haiying He, Yake Wei, Yandong Wen, and Di Hu. Patch matters: Training-free fine-grained image caption enhancement via local perception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3963–3973, 2025

work page 2025

[29] [29]

In-context editing: Learning knowledge from self-induced distributions

Siyuan Qi, Bangcheng Yang, Kailin Jiang, Xiaobo Wang, Jiaqi Li, Yifan Zhong, Yaodong Yang, and Zilong Zheng. In-context editing: Learning knowledge from self-induced distributions. arXiv preprint arXiv:2406.11194, 2024

work page arXiv 2024

[30] [30]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

work page 2026

[31] [31]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth interna- tional conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

work page 2011

[32] [32]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration

Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, and Jianwei Yin. Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6613–6629, 2025

work page 2025

[34] [34]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[36] [36]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

work page 2024

[37] [37]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024

work page 2024

[38] [38]

Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

work page arXiv 2025

[39] [39]

Grasp any region: Towards precise, contextual pixel understanding for multimodal llms.arXiv preprint arXiv:2510.18876, 2025

Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Jiani Zheng, Ye Tian, Jiahao Meng, Zilong Huang, et al. Grasp any region: Towards precise, contextual pixel understanding for multimodal llms.arXiv preprint arXiv:2510.18876, 2025

work page arXiv 2025

[40] [40]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

VGR: Visual Grounded Reasoning

Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, et al. Vgr: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Hopchain: Multi-hop data synthesis for generalizable vision-language reasoning.arXiv preprint arXiv:2603.17024, 2026

Shenzhi Wang, Shixuan Liu, Jing Zhou, Chang Gao, Xiong-Hui Chen, Binghai Wang, An Yang, Shiji Song, Bowen Yu, Gao Huang, et al. Hopchain: Multi-hop data synthesis for generalizable vision-language reasoning.arXiv preprint arXiv:2603.17024, 2026. 12

work page arXiv 2026

[43] [43]

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7907–7915, 2025

work page 2025

[44] [44]

Vg-refiner: Towards tool-refined referring grounded reasoning via agentic reinforcement learning.arXiv preprint arXiv:2512.06373, 2025

Yuji Wang, Wenlong Liu, Jingxuan Niu, Haoji Zhang, and Yansong Tang. Vg-refiner: Towards tool-refined referring grounded reasoning via agentic reinforcement learning.arXiv preprint arXiv:2512.06373, 2025

work page arXiv 2025

[45] [45]

Advancing multimodal reasoning via reinforcement learning with cold start

Lai Wei, Yuting Li, Kaipeng Zheng, Chen Wang, Yue Wang, Linghe Kong, Lichao Sun, and Weiran Huang. Advancing multimodal reasoning via reinforcement learning with cold start. arXiv preprint arXiv:2505.22334, 2025

work page arXiv 2025

[46] [46]

Zooming without zooming: Region-to-image distillation for fine-grained multimodal perception.arXiv preprint arXiv:2602.11858, 2026

Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li, Huijia Zhu, Weiqiang Wang, Linghe Kong, Yue Wang, et al. Zooming without zooming: Region-to-image distillation for fine-grained multimodal perception.arXiv preprint arXiv:2602.11858, 2026

work page arXiv 2026

[47] [47]

Perception in reflection.arXiv preprint arXiv:2504.07165, 2025

Yana Wei, Liang Zhao, Kangheng Lin, En Yu, Yuang Peng, Runpei Dong, Jianjian Sun, Haoran Wei, Zheng Ge, Xiangyu Zhang, et al. Perception in reflection.arXiv preprint arXiv:2504.07165, 2025

work page arXiv 2025

[48] [48]

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

V*: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

work page 2024

[50] [50]

Mimo-vl technical report, 2025

LLM-Core-Team Xiaomi. Mimo-vl technical report, 2025. URL https://arxiv.org/abs/ 2506.03569

work page arXiv 2025

[51] [51]

Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling.arXiv preprint arXiv:2410.11325, 2024

Wenda Xu, Rujun Han, Zifeng Wang, Long T Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-Yu Lee, and Tomas Pfister. Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling.arXiv preprint arXiv:2410.11325, 2024

work page arXiv 2024

[52] [52]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

Zoom-refine: Boosting high-resolution multimodal understanding via localized zoom and self-refinement.arXiv preprint arXiv:2506.01663, 2025

Xuan Yu, Dayan Guan, and Yanfeng Gu. Zoom-refine: Boosting high-resolution multimodal understanding via localized zoom and self-refinement.arXiv preprint arXiv:2506.01663, 2025

work page arXiv 2025

[56] [56]

Visual reasoning tracer: Object-level grounded reasoning benchmark.arXiv preprint arXiv:2512.05091, 2025

Haobo Yuan, Yueyi Sun, Yanwei Li, Tao Zhang, Xueqing Deng, Henghui Ding, Lu Qi, Anran Wang, Xiangtai Li, and Ming-Hsuan Yang. Visual reasoning tracer: Object-level grounded reasoning benchmark.arXiv preprint arXiv:2512.05091, 2025

work page arXiv 2025

[57] [57]

Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

work page 2022

[58] [58]

MLLMs know where to look: Training-free perception of small visual details with multimodal LLMs

Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. MLLMs know where to look: Training-free perception of small visual details with multimodal LLMs. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=DgaY5mDdmT. 13

work page 2025

[59] [59]

Finers: Fine-grained reasoning and segmentation of small objects with reinforcement learning

Lu Zhang, Jiazuo Yu, Haomiao Xiong, Ping Hu, Yunzhi Zhuge, Huchuan Lu, and You He. Finers: Fine-grained reasoning and segmentation of small objects with reinforcement learning. arXiv preprint arXiv:2510.21311, 2025

work page arXiv 2025

[60] [60]

Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv e-prints, pages arXiv–2505, 2025

Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv e-prints, pages arXiv–2505, 2025

work page 2025

[61] [61]

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[62] [62]

Thyme: Think Beyond Images

Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[63] [63]

Skywork-r1v4: Toward agentic multimodal intelligence through interleaved thinking with images and deepresearch.arXiv preprint arXiv:2512.02395, 2025

Yifan Zhang, Liang Hu, Haofeng Sun, Peiyu Wang, Yichen Wei, Shukang Yin, Jiangbo Pei, Wei Shen, Peng Xia, Yi Peng, et al. Skywork-r1v4: Toward agentic multimodal intelligence through interleaved thinking with images and deepresearch.arXiv preprint arXiv:2512.02395, 2025

work page arXiv 2025

[64] [64]

Evaluating and steering modality preferences in multimodal large language model

Yu Zhang, Jinlong Ma, Yongshuai Hou, Xuefeng Bai, Kehai Chen, Yang Xiang, Jun Yu, and Min Zhang. Evaluating and steering modality preferences in multimodal large language model. arXiv preprint arXiv:2505.20977, 2025

work page arXiv 2025

[65] [65]

Instruction Anchor: Dissecting the Mechanistic Dynamics of Modality Arbitration

Yu Zhang, Mufan Xu, Xuefeng Bai, Pengfei Zhang, Yang Xiang, Min Zhang, et al. In- struction anchors: Dissecting the causal dynamics of modality arbitration.arXiv preprint arXiv:2602.03677, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[66] [66]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[67] [67]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[68] [68]

Thinking-with-Images

Qiji Zhou, Ruochen Zhou, Zike Hu, Panzhong Lu, Siyang Gao, and Yue Zhang. Image-of- thought prompting for visual reasoning refinement in multimodal large language models.arXiv preprint arXiv:2405.13872, 2024. 14 A Inference speed comparison Vision-OPD-9B DeepEyes Thyme DeepEyesV2 SenseNova-MARS 0.0 0.5 1.0 1.5 2.0 2.5 3.0Inference Speed (Samples/s) Figure...

work page arXiv 2024