Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Hongyu Lin; Jie Lou; Le Sun; Qianhao Yuan; Xianpei Han; Xing Yu; Yaojie Lu

arxiv: 2605.18740 · v4 · pith:AM5CTMEEnew · submitted 2026-05-18 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Qianhao Yuan , Jie Lou , Xing Yu , Hongyu Lin , Le Sun , Xianpei Han , Yaojie Lu This is my paper

Pith reviewed 2026-05-20 10:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords Vision-OPDself-distillationmultimodal LLMsfine-grained visual understandingon-policy learningregional-to-global perception gapimage crops

0 comments

The pith

Vision-OPD lets MLLMs internalize fine-grained visual focus by self-distilling from their own evidence-centered crops to full images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies a regional-to-global perception gap in multimodal large language models, where the same model answers fine-grained questions more accurately from focused image crops than from full images. To address this, it introduces Vision-OPD, a self-distillation method that uses the crop-conditioned version of the model as a teacher to guide the full-image version through on-policy rollouts, reducing differences in their token predictions. This approach allows the model to learn better attention to relevant details without any external models, labels, or additional tools during inference. Experiments demonstrate that models trained this way perform competitively against much larger systems on fine-grained visual benchmarks.

Core claim

Vision-OPD transfers the privileged perception from a crop-conditioned teacher policy to a full-image student policy by minimizing token-level divergence between their next-token distributions along the student's on-policy rollouts, enabling the MLLM to internalize the benefits of visual zooming internally.

What carries the argument

On-policy self-distillation from a crop-conditioned teacher to a full-image student within the same MLLM, minimizing divergence on generated rollouts to close the regional-to-global perception gap.

If this is right

The trained model performs better on fine-grained visual tasks using only full images.
It eliminates the need for external zooming or cropping tools at inference time.
Performance reaches levels competitive with larger or agentic models.
The method works without ground-truth labels or reward models.
Regional perception advantages can be internalized into global processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could lead to more efficient vision-language models that do not require high-resolution processing for all tasks.
Similar self-distillation might apply to other sensory modalities or perception challenges in AI.
Exploring variations in how crops are selected could further optimize the transfer process.

Load-bearing premise

The performance advantage on evidence-centered crops over full images stems from a focus problem that can be transferred via next-token distribution matching rather than from inherent differences in recognition capability.

What would settle it

Running the Vision-OPD training on a model and observing no gain or a loss in accuracy on fine-grained visual understanding benchmarks compared to the original model would falsify the effectiveness of the distillation approach.

Figures

Figures reproduced from arXiv: 2605.18740 by Hongyu Lin, Jie Lou, Le Sun, Qianhao Yuan, Xianpei Han, Xing Yu, Yaojie Lu.

**Figure 2.** Figure 2: A case of the regional-to-global gap, based on Qwen3.5-9B. The global image input leads to the wrong answer, while the cropped region input yields the correct answer. 45 50 55 60 65 70 75 80 Accuracy (%) Qwen3.5-4B Qwen3.5-9B GLM-4.6V GPT-5.4 Gemini-3.1-Pro +21.7 +19.5 +22.1 +19.3 +18.1 Gap Global Regional [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Overview of Vision-OPD. Left: Fine-grained visual questions are generated on evidencecentered crops and grounded back to the full image via bounding-box overlay. Right: A teacher policy pT (· | xcrop) and a student policy pS(· | xglobal) are instantiated from the same MLLM. The student generates on-policy rollouts y ∼ pS, and the per-token divergence D(pT ∥pS) along these rollouts provides dense supervisi… view at source ↗

**Figure 5.** Figure 5: Regional-to-global gap during VisionOPD training. A lower gap indicates that the model can better recover crop-visible evidence from the full image. full image. To test whether Vision-OPD addresses this bottleneck during training, we use the same comparison as in Section 3.1: each checkpoint answers the same question with the full image as input and with the evidence-centered crop as input. We track the r… view at source ↗

**Figure 6.** Figure 6: Inference speed comparison. Vision-OPD-9B achieves faster inference than agentic [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned on evidence-centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision-OPD (Vision On-Policy Distillation), a regional-to-global self-distillation framework that transfers the model's own privileged regional perception to its full-image policy. Vision-OPD instantiates two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts, and Vision-OPD minimizes token-level divergence between the teacher and student next-token distributions along these rollouts. This enables the model to internalize the benefit of visual zooming without external teacher models, ground-truth labels, reward verifiers, or inference-time tool use. Experiments on multiple fine-grained visual understanding benchmarks show that Vision-OPD models achieve competitive or superior performance against much larger open-source, closed-source, and "Thinking-with-Images" agentic models. The code is available at https://github.com/VisionOPD/Vision-OPD

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that MLLMs exhibit a regional-to-global perception gap, answering fine-grained questions more accurately on evidence-centered crops than full images. It proposes Vision-OPD, an on-policy self-distillation method that trains a full-image student policy to match the next-token distributions of a crop-conditioned teacher policy (instantiated from the same MLLM) along student-generated rollouts, thereby internalizing zooming benefits without external teachers, labels, verifiers, or inference-time tools. Experiments reportedly show competitive or superior results on fine-grained visual benchmarks versus larger models and agentic baselines.

Significance. If the regional-to-global gap holds and the distillation transfers it without implicit supervision in crop construction, the result would be significant: it offers a label-free, model-internal route to improve detail-oriented multimodal reasoning, potentially reducing reliance on scale or external agents while remaining compatible with existing MLLM training pipelines.

major comments (2)

§3.1 (Crop-conditioned teacher construction): the procedure for selecting 'evidence-centered crops' must be specified in detail; if crop selection uses answer-derived heuristics, model-based region proposals, or any post-hoc verification that encodes the fine-grained signal, the claimed absence of ground-truth labels or external privilege is undermined and the regional-to-global gap is no longer purely emergent.
§4 (Experiments and ablations): the central performance claims rest on observed gaps between Vision-OPD and baselines, yet no ablations isolate the contribution of on-policy rollouts versus simple crop augmentation, and no statistical significance or variance estimates across runs are reported; without these controls the attribution of gains to the distillation procedure remains unverified.

minor comments (2)

Notation for the token-level divergence loss (Eq. 3 or equivalent) should explicitly state whether KL is computed only on student-generated tokens or includes teacher-forced tokens.
Figure 2 (method overview) would benefit from an explicit arrow or label showing the on-policy rollout path from student to teacher comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [—] §3.1 (Crop-conditioned teacher construction): the procedure for selecting 'evidence-centered crops' must be specified in detail; if crop selection uses answer-derived heuristics, model-based region proposals, or any post-hoc verification that encodes the fine-grained signal, the claimed absence of ground-truth labels or external privilege is undermined and the regional-to-global gap is no longer purely emergent.

Authors: We agree that full transparency on crop construction is essential. In the revised manuscript we will expand §3.1 with a complete algorithmic description and pseudocode of the evidence-centered crop procedure. The selection operates without access to ground-truth answers, without post-hoc verification against the answer, and without any mechanism that injects the fine-grained supervisory signal into the crop itself. This preserves the claim that the observed regional-to-global gap is emergent from the MLLM’s own perception rather than from privileged crop construction. We will also add an explicit statement confirming the absence of external labels or verifiers at crop-generation time. revision: yes
Referee: [—] §4 (Experiments and ablations): the central performance claims rest on observed gaps between Vision-OPD and baselines, yet no ablations isolate the contribution of on-policy rollouts versus simple crop augmentation, and no statistical significance or variance estimates across runs are reported; without these controls the attribution of gains to the distillation procedure remains unverified.

Authors: We acknowledge that isolating the on-policy component and providing statistical context would strengthen attribution. In the revision we will add a new ablation in §4 that directly compares (i) the full Vision-OPD pipeline against (ii) a simple crop-augmentation baseline that feeds crops to the student without on-policy rollouts or distillation. We will also report mean performance and standard deviation over three independent training runs with different random seeds, together with error bars on the main benchmark tables. These results will appear in the main paper and supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in Vision-OPD derivation chain

full rationale

The paper's derivation starts from an empirical observation of a regional-to-global perception gap (same MLLM performs better on evidence-centered crops than full images) and proceeds to a self-distillation procedure that instantiates crop-conditioned and full-image policies from the identical base MLLM, then minimizes token-level divergence along the student's on-policy rollouts. This chain does not reduce any claimed result to its inputs by construction: the crop advantage is presented as an independent, testable fact rather than a definitional premise, the distillation objective is a standard on-policy KL-style transfer that does not presuppose the final performance gain, and no self-citation or uniqueness theorem is invoked to force the method. The approach remains self-contained against external benchmarks because the training signal derives from differential conditioning on the same model rather than from fitted parameters renamed as predictions or from externally privileged labels.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the empirical observation of a crop advantage and the standard assumption that aligning next-token distributions improves policy behavior; no new physical entities or ad-hoc constants are introduced.

axioms (1)

domain assumption The crop-conditioned version of the MLLM produces superior next-token distributions for fine-grained questions relative to the full-image version.
This observation is invoked to justify why distilling from the crop policy should improve the full-image policy.

pith-pipeline@v0.9.0 · 5789 in / 1214 out tokens · 48477 ms · 2026-05-20T10:53:32.140952+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Vision-OPD instantiates two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts, and Vision-OPD minimizes token-level divergence between the teacher and student next-token distributions along these rollouts.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Vision-OPD, a regional-to-global self-distillation framework... without external teacher models, ground-truth labels, reward verifiers, or inference-time tool use.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.