Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
Pith reviewed 2026-05-20 10:53 UTC · model grok-4.3
The pith
Vision-OPD lets MLLMs internalize fine-grained visual focus by self-distilling from their own evidence-centered crops to full images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Vision-OPD transfers the privileged perception from a crop-conditioned teacher policy to a full-image student policy by minimizing token-level divergence between their next-token distributions along the student's on-policy rollouts, enabling the MLLM to internalize the benefits of visual zooming internally.
What carries the argument
On-policy self-distillation from a crop-conditioned teacher to a full-image student within the same MLLM, minimizing divergence on generated rollouts to close the regional-to-global perception gap.
If this is right
- The trained model performs better on fine-grained visual tasks using only full images.
- It eliminates the need for external zooming or cropping tools at inference time.
- Performance reaches levels competitive with larger or agentic models.
- The method works without ground-truth labels or reward models.
- Regional perception advantages can be internalized into global processing.
Where Pith is reading between the lines
- This could lead to more efficient vision-language models that do not require high-resolution processing for all tasks.
- Similar self-distillation might apply to other sensory modalities or perception challenges in AI.
- Exploring variations in how crops are selected could further optimize the transfer process.
Load-bearing premise
The performance advantage on evidence-centered crops over full images stems from a focus problem that can be transferred via next-token distribution matching rather than from inherent differences in recognition capability.
What would settle it
Running the Vision-OPD training on a model and observing no gain or a loss in accuracy on fine-grained visual understanding benchmarks compared to the original model would falsify the effectiveness of the distillation approach.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned on evidence-centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision-OPD (Vision On-Policy Distillation), a regional-to-global self-distillation framework that transfers the model's own privileged regional perception to its full-image policy. Vision-OPD instantiates two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts, and Vision-OPD minimizes token-level divergence between the teacher and student next-token distributions along these rollouts. This enables the model to internalize the benefit of visual zooming without external teacher models, ground-truth labels, reward verifiers, or inference-time tool use. Experiments on multiple fine-grained visual understanding benchmarks show that Vision-OPD models achieve competitive or superior performance against much larger open-source, closed-source, and "Thinking-with-Images" agentic models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that MLLMs exhibit a regional-to-global perception gap, answering fine-grained questions more accurately on evidence-centered crops than full images. It proposes Vision-OPD, an on-policy self-distillation method that trains a full-image student policy to match the next-token distributions of a crop-conditioned teacher policy (instantiated from the same MLLM) along student-generated rollouts, thereby internalizing zooming benefits without external teachers, labels, verifiers, or inference-time tools. Experiments reportedly show competitive or superior results on fine-grained visual benchmarks versus larger models and agentic baselines.
Significance. If the regional-to-global gap holds and the distillation transfers it without implicit supervision in crop construction, the result would be significant: it offers a label-free, model-internal route to improve detail-oriented multimodal reasoning, potentially reducing reliance on scale or external agents while remaining compatible with existing MLLM training pipelines.
major comments (2)
- §3.1 (Crop-conditioned teacher construction): the procedure for selecting 'evidence-centered crops' must be specified in detail; if crop selection uses answer-derived heuristics, model-based region proposals, or any post-hoc verification that encodes the fine-grained signal, the claimed absence of ground-truth labels or external privilege is undermined and the regional-to-global gap is no longer purely emergent.
- §4 (Experiments and ablations): the central performance claims rest on observed gaps between Vision-OPD and baselines, yet no ablations isolate the contribution of on-policy rollouts versus simple crop augmentation, and no statistical significance or variance estimates across runs are reported; without these controls the attribution of gains to the distillation procedure remains unverified.
minor comments (2)
- Notation for the token-level divergence loss (Eq. 3 or equivalent) should explicitly state whether KL is computed only on student-generated tokens or includes teacher-forced tokens.
- Figure 2 (method overview) would benefit from an explicit arrow or label showing the on-policy rollout path from student to teacher comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [—] §3.1 (Crop-conditioned teacher construction): the procedure for selecting 'evidence-centered crops' must be specified in detail; if crop selection uses answer-derived heuristics, model-based region proposals, or any post-hoc verification that encodes the fine-grained signal, the claimed absence of ground-truth labels or external privilege is undermined and the regional-to-global gap is no longer purely emergent.
Authors: We agree that full transparency on crop construction is essential. In the revised manuscript we will expand §3.1 with a complete algorithmic description and pseudocode of the evidence-centered crop procedure. The selection operates without access to ground-truth answers, without post-hoc verification against the answer, and without any mechanism that injects the fine-grained supervisory signal into the crop itself. This preserves the claim that the observed regional-to-global gap is emergent from the MLLM’s own perception rather than from privileged crop construction. We will also add an explicit statement confirming the absence of external labels or verifiers at crop-generation time. revision: yes
-
Referee: [—] §4 (Experiments and ablations): the central performance claims rest on observed gaps between Vision-OPD and baselines, yet no ablations isolate the contribution of on-policy rollouts versus simple crop augmentation, and no statistical significance or variance estimates across runs are reported; without these controls the attribution of gains to the distillation procedure remains unverified.
Authors: We acknowledge that isolating the on-policy component and providing statistical context would strengthen attribution. In the revision we will add a new ablation in §4 that directly compares (i) the full Vision-OPD pipeline against (ii) a simple crop-augmentation baseline that feeds crops to the student without on-policy rollouts or distillation. We will also report mean performance and standard deviation over three independent training runs with different random seeds, together with error bars on the main benchmark tables. These results will appear in the main paper and supplementary material. revision: yes
Circularity Check
No significant circularity detected in Vision-OPD derivation chain
full rationale
The paper's derivation starts from an empirical observation of a regional-to-global perception gap (same MLLM performs better on evidence-centered crops than full images) and proceeds to a self-distillation procedure that instantiates crop-conditioned and full-image policies from the identical base MLLM, then minimizes token-level divergence along the student's on-policy rollouts. This chain does not reduce any claimed result to its inputs by construction: the crop advantage is presented as an independent, testable fact rather than a definitional premise, the distillation objective is a standard on-policy KL-style transfer that does not presuppose the final performance gain, and no self-citation or uniqueness theorem is invoked to force the method. The approach remains self-contained against external benchmarks because the training signal derives from differential conditioning on the same model rather than from fitted parameters renamed as predictions or from externally privileged labels.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The crop-conditioned version of the MLLM produces superior next-token distributions for fine-grained questions relative to the full-image version.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Vision-OPD instantiates two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts, and Vision-OPD minimizes token-level divergence between the teacher and student next-token distributions along these rollouts.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose Vision-OPD, a regional-to-global self-distillation framework... without external teacher models, ground-truth labels, reward verifiers, or inference-time tool use.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
On-policy distillation of language models: Learning from self-generated mistakes
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024
work page 2024
-
[2]
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024
work page 2024
-
[3]
Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, et al. Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025
-
[4]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
work page 2026
-
[7]
Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452, 2025
- [8]
-
[9]
Google. Gemini 3.1 pro. https://deepmind.google/models/model-cards/ gemini-3-1-pro/, 2026
work page 2026
-
[10]
Minillm: Knowledge distillation of large language models
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe twelfth international conference on learning representations, 2024. 10
work page 2024
-
[11]
Reinforced Self-Training (ReST) for Language Modeling
Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling.arXiv preprint arXiv:2308.08998, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
DeepEyesV2: Toward Agentic Multimodal Model
Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Xinhai Hou, Shaoyuan Xu, Manan Biyani, Moyan Li, Jia Liu, Todd C Hollon, and Bryan Wang. Codev: Code with images for faithful visual reasoning via tool-aware policy optimization.arXiv preprint arXiv:2511.19661, 2025
-
[15]
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024
work page 2024
-
[16]
Reinforcement Learning via Self-Distillation
Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Evaluating object hallucination in large vision-language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023
work page 2023
-
[19]
Vlm-fo1: Bridging the gap between high-level reasoning and fine-grained perception in vlms
Peng Liu, Haozhan Shen, Chunxin Fang, Zhicheng Sun, Jiajia Liao, and Tiancheng Zhao. Vlm-fo1: Bridging the gap between high-level reasoning and fine-grained perception in vlms. arXiv preprint arXiv:2509.25916, 2025
-
[20]
HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling
Xianjie Liu, Yiman Hu, Yixiong Zou, Liang Wu, Jian Xu, and Bo Zheng. Hide: Rethinking the zoom-in method in high resolution mllms via hierarchical decoupling.arXiv preprint arXiv:2510.00054, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, and Jiwen Lu. Chain-of-spot: Interactive reasoning improves large vision-language models.arXiv preprint arXiv:2403.12966, 2024
-
[22]
On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025
Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy- distillation
-
[23]
Bozhi Luan, Hao Feng, Hong Chen, Yonghui Wang, Wengang Zhou, and Houqiang Li. Textcot: Zoom-in for enhanced multimodal text-rich image understanding.ACM Transactions on Multimedia Computing, Communications and Applications, 22(4):1–19, 2026
work page 2026
-
[24]
Jinlong Ma, Yu Zhang, Xuefeng Bai, Kehai Chen, Yuwei Wang, Zeming Liu, Jun Yu, and Min Zhang. Beyond unimodal shortcuts: Mllms as cross-modal reasoners for grounded named entity recognition.arXiv preprint arXiv:2602.04486, 2026
-
[25]
Gpt-5.1.https://openai.com/index/gpt-5-1/, 2025
OpenAI. Gpt-5.1.https://openai.com/index/gpt-5-1/, 2025
work page 2025
-
[26]
OpenAI. Introducing gpt-5.2. https://openai.com/index/introducing-gpt-5-2/ , 2025
work page 2025
-
[27]
OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026. 11
work page 2026
-
[28]
Patch matters: Training-free fine-grained image caption enhancement via local perception
Ruotian Peng, Haiying He, Yake Wei, Yandong Wen, and Di Hu. Patch matters: Training-free fine-grained image caption enhancement via local perception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3963–3973, 2025
work page 2025
-
[29]
In-context editing: Learning knowledge from self-induced distributions
Siyuan Qi, Bangcheng Yang, Kailin Jiang, Xiaobo Wang, Jiaqi Li, Yifan Zhong, Yaodong Yang, and Zilong Zheng. In-context editing: Learning knowledge from self-induced distributions. arXiv preprint arXiv:2406.11194, 2024
-
[30]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5
work page 2026
-
[31]
A reduction of imitation learning and structured prediction to no-regret online learning
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth interna- tional conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011
work page 2011
-
[32]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, and Jianwei Yin. Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6613–6629, 2025
work page 2025
-
[34]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[36]
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024
work page 2024
-
[37]
Eyes wide shut? exploring the visual shortcomings of multimodal llms
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024
work page 2024
-
[38]
Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025
-
[39]
Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Jiani Zheng, Ye Tian, Jiahao Meng, Zilong Huang, et al. Grasp any region: Towards precise, contextual pixel understanding for multimodal llms.arXiv preprint arXiv:2510.18876, 2025
-
[40]
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
VGR: Visual Grounded Reasoning
Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, et al. Vgr: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Shenzhi Wang, Shixuan Liu, Jing Zhou, Chang Gao, Xiong-Hui Chen, Binghai Wang, An Yang, Shiji Song, Bowen Yu, Gao Huang, et al. Hopchain: Multi-hop data synthesis for generalizable vision-language reasoning.arXiv preprint arXiv:2603.17024, 2026. 12
-
[43]
Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7907–7915, 2025
work page 2025
-
[44]
Yuji Wang, Wenlong Liu, Jingxuan Niu, Haoji Zhang, and Yansong Tang. Vg-refiner: Towards tool-refined referring grounded reasoning via agentic reinforcement learning.arXiv preprint arXiv:2512.06373, 2025
-
[45]
Advancing multimodal reasoning via reinforcement learning with cold start
Lai Wei, Yuting Li, Kaipeng Zheng, Chen Wang, Yue Wang, Linghe Kong, Lichao Sun, and Weiran Huang. Advancing multimodal reasoning via reinforcement learning with cold start. arXiv preprint arXiv:2505.22334, 2025
-
[46]
Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li, Huijia Zhu, Weiqiang Wang, Linghe Kong, Yue Wang, et al. Zooming without zooming: Region-to-image distillation for fine-grained multimodal perception.arXiv preprint arXiv:2602.11858, 2026
-
[47]
Perception in reflection.arXiv preprint arXiv:2504.07165, 2025
Yana Wei, Liang Zhao, Kangheng Lin, En Yu, Yuang Peng, Runpei Dong, Jianjian Sun, Haoran Wei, Zheng Ge, Xiangyu Zhang, et al. Perception in reflection.arXiv preprint arXiv:2504.07165, 2025
-
[48]
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
V*: Guided visual search as a core mechanism in multimodal llms
Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024
work page 2024
-
[50]
Mimo-vl technical report, 2025
LLM-Core-Team Xiaomi. Mimo-vl technical report, 2025. URL https://arxiv.org/abs/ 2506.03569
-
[51]
Wenda Xu, Rujun Han, Zifeng Wang, Long T Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-Yu Lee, and Tomas Pfister. Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling.arXiv preprint arXiv:2410.11325, 2024
-
[52]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Xuan Yu, Dayan Guan, and Yanfeng Gu. Zoom-refine: Boosting high-resolution multimodal understanding via localized zoom and self-refinement.arXiv preprint arXiv:2506.01663, 2025
-
[56]
Haobo Yuan, Yueyi Sun, Yanwei Li, Tao Zhang, Xueqing Deng, Henghui Ding, Lu Qi, Anran Wang, Xiangtai Li, and Ming-Hsuan Yang. Visual reasoning tracer: Object-level grounded reasoning benchmark.arXiv preprint arXiv:2512.05091, 2025
-
[57]
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022
work page 2022
-
[58]
MLLMs know where to look: Training-free perception of small visual details with multimodal LLMs
Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. MLLMs know where to look: Training-free perception of small visual details with multimodal LLMs. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=DgaY5mDdmT. 13
work page 2025
-
[59]
Finers: Fine-grained reasoning and segmentation of small objects with reinforcement learning
Lu Zhang, Jiazuo Yu, Haomiao Xiong, Ping Hu, Yunzhi Zhuge, Huchuan Lu, and You He. Finers: Fine-grained reasoning and segmentation of small objects with reinforcement learning. arXiv preprint arXiv:2510.21311, 2025
-
[60]
Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv e-prints, pages arXiv–2505, 2025
work page 2025
-
[61]
Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[62]
Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Yifan Zhang, Liang Hu, Haofeng Sun, Peiyu Wang, Yichen Wei, Shukang Yin, Jiangbo Pei, Wei Shen, Peng Xia, Yi Peng, et al. Skywork-r1v4: Toward agentic multimodal intelligence through interleaved thinking with images and deepresearch.arXiv preprint arXiv:2512.02395, 2025
-
[64]
Evaluating and steering modality preferences in multimodal large language model
Yu Zhang, Jinlong Ma, Yongshuai Hou, Xuefeng Bai, Kehai Chen, Yang Xiang, Jun Yu, and Min Zhang. Evaluating and steering modality preferences in multimodal large language model. arXiv preprint arXiv:2505.20977, 2025
-
[65]
Instruction Anchor: Dissecting the Mechanistic Dynamics of Modality Arbitration
Yu Zhang, Mufan Xu, Xuefeng Bai, Pengfei Zhang, Yang Xiang, Min Zhang, et al. In- struction anchors: Dissecting the causal dynamics of modality arbitration.arXiv preprint arXiv:2602.03677, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[66]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[67]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
Qiji Zhou, Ruochen Zhou, Zike Hu, Panzhong Lu, Siyang Gao, and Yue Zhang. Image-of- thought prompting for visual reasoning refinement in multimodal large language models.arXiv preprint arXiv:2405.13872, 2024. 14 A Inference speed comparison Vision-OPD-9B DeepEyes Thyme DeepEyesV2 SenseNova-MARS 0.0 0.5 1.0 1.5 2.0 2.5 3.0Inference Speed (Samples/s) Figure...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.