arxiv: 2604.03307 · v2 · submitted 2026-03-31 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators

Jiazhou Zhou , Yucheng Chen , Hongyang Li , Qing Jiang , Hu Zhou , Ying-Cong Chen , Lei Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords V-Reflectionmultimodal large language modelsvisual reflectionactive interrogationfine-grained perceptiondistillationlatent probesperception hallucinations

0 comments

The pith

V-Reflection turns MLLMs into active visual interrogators by using latent states as dynamic probes during reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models often fail on fine-grained visual tasks because their reasoning operates only in language while the image remains a fixed, non-interactive input. V-Reflection addresses this by introducing a think-then-look process in which the model's evolving latent states act as probes that query the visual feature map for task-critical evidence at each reasoning step. The capability is acquired through two-stage distillation: a Box-Guided Compression teacher first learns explicit spatial localization, then transfers that knowledge to a Dynamic Autoregressive Compression student that operates inside the main model's hidden states. After training, the auxiliary modules are disabled, leaving only standard autoregressive decoding that now produces grounded reasoning. Experiments on six perception-heavy benchmarks show reduced hallucinations and better localization of relevant visual details.

Core claim

V-Reflection transforms the MLLM into an active interrogator through a 'think-then-look' visual reflection mechanism. During reasoning, latent states function as dynamic probes that actively interrogate the visual feature space, grounding each reasoning step for task-critical evidence. Our approach employs a two-stage distillation strategy. First, the Box-Guided Compression Module (BCM) establishes stable pixel-to-latent targets through explicit spatial grounding. Next, a Dynamic Autoregressive Compression (DAC) module maps the model's hidden states into dynamic probes that interrogate the global visual feature map. By distilling the spatial expertise of the BCM teacher into the DAC student,

What carries the argument

Dynamic Autoregressive Compression (DAC) module, which maps the model's hidden states into dynamic probes that interrogate the global visual feature map, trained by distillation from a Box-Guided Compression teacher that provides explicit spatial targets.

If this is right

Reasoning steps become grounded in actual visual evidence extracted on demand rather than relying solely on the initial image encoding.
The model maintains standard autoregressive efficiency because both compression modules are inactive at inference time.
Visualizations show that latent states autonomously localize task-critical regions without external guidance.
The approach narrows the fine-grained perception gap across six perception-intensive benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same probe-style training could be applied to other modalities by distilling localization behavior into their internal states.
Models might accumulate fewer perception errors across long reasoning chains because each step can re-query the input features.
The method suggests that perception hallucinations can be mitigated by making internal representations responsible for directing attention rather than fixing all visual information at the start.

Load-bearing premise

The two-stage distillation from the Box-Guided Compression Module teacher successfully transfers the ability to localize task-critical evidence into the Dynamic Autoregressive Compression student so the main model can perform the interrogation at inference without the auxiliary modules.

What would settle it

Performance on fine-grained benchmarks would remain unchanged if the distillation step were removed, or if the latent probes failed to align with human-annotated task-critical regions during reasoning.

Figures

Figures reproduced from arXiv: 2604.03307 by Hongyang Li, Hu Zhou, Jiazhou Zhou, Lei Zhang, Qing Jiang, Ying-Cong Chen, Yucheng Chen.

**Figure 2.** Figure 2: The V-Reflection Architecture. A two-stage paradigm establishes a "think-then-look" visual [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of Latent Reasoning during Training. Averaged attention maps across all reasoning steps demonstrate that while the teacher (BCM) is confined to local priors, the student (DAC) successfully transcends bounding box constraints to capture global contextual relationships. 4.5 Visualization Analysis: How Latent Tokens Help Visual Self-Reflection Reasoning? Visualization of Visual Latent Distillati… view at source ↗

**Figure 4.** Figure 4: Visualization of Latent Reasoning during Inference. Averaged attention maps across all latent reasoning steps (Col. 3) autonomously pinpointing visual evidence driven by latent states. References [1] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023. 1 [2] Shuai Bai, Yuxuan Cai, Ru… view at source ↗

**Figure 5.** Figure 5: Visualization of visual latent distillation during training. While the teacher (BCM) is confined to local priors, the student (DAC) successfully transcends bounding box constraints to capture global contextual relationships. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of visual latent reasoning during inference. Rows 1–4: cross-image; Rows 5–6: direct attribute. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of visual latent reasoning during inference. Row 1: OCR; Rows 2–3: tables and diagrams; Rows 4–5: IQ tests. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of visual latent reasoning during inference. Rows 1–2: monitoring; Rows 3–4: remote sensing; Rows 5-6: autonomous driving. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success, yet they remain prone to perception-related hallucinations in fine-grained tasks. This vulnerability arises from a fundamental limitation: their reasoning is largely restricted to the language domain, treating visual input as a static, reasoning-agnostic preamble rather than a dynamic participant. Consequently, current models act as passive observers, unable to re-examine visual details to ground their evolving reasoning states. To overcome this, we propose V-Reflection, a framework that transforms the MLLM into an active interrogator through a "think-then-look" visual reflection mechanism. During reasoning, latent states function as dynamic probes that actively interrogate the visual feature space, grounding each reasoning step for task-critical evidence. Our approach employs a two-stage distillation strategy. First, the Box-Guided Compression Module (BCM) establishes stable pixel-to-latent targets through explicit spatial grounding. Next, a Dynamic Autoregressive Compression (DAC) module maps the model's hidden states into dynamic probes that interrogate the global visual feature map. By distilling the spatial expertise of the BCM teacher into the DAC student, V-Reflection internalizes the ability to localize task-critical evidence. During inference, both modules remain entirely inactive, maintaining a purely end-to-end autoregressive decoding in the latent space with optimal efficiency. Extensive experiments demonstrate the effectiveness of our V-Reflection across six perception-intensive benchmarks, significantly narrowing the fine-grained perception gap. Visualizations confirm that latent reasoning autonomously localizes task-critical visual evidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

V-Reflection proposes distilling active visual probing into MLLMs via latent states, but lacks the ablations needed to confirm it works independently.

read the letter

The core claim here is that a two-stage distillation lets an MLLM use its own latent states as dynamic probes to re-examine visual features during reasoning, turning the model from a passive observer into something that actively grounds each step. The think-then-look loop and the BCM-to-DAC transfer are the concrete pieces that feel new compared to standard prefix-style vision-language setups. The paper does a solid job naming the real limitation—reasoning stays in language while vision stays frozen—and the visualizations are presented as evidence that localization happens autonomously once training is done. That framing is useful for anyone working on fine-grained perception or hallucination reduction. The soft spot is the missing verification that the probing ability survives complete removal of the auxiliary modules. The abstract states both BCM and DAC stay off at inference, yet without ablations that isolate the base model’s performance before and after the full distillation pipeline, it is hard to rule out that some of the reported gains on the six benchmarks still trace back to training-time signals rather than internalized behavior. No quantitative deltas, error breakdowns, or direct comparisons appear in the summary either, which keeps the practical size of the improvement unclear. This is aimed at researchers building more reliable multimodal systems who want training recipes for tighter vision-language coupling. A reader already experimenting with MLLM internals or distillation for localization would find the method description worth examining. I would send it for peer review because the problem is well-posed and the proposed mechanism is specific enough that referees could usefully test the transfer claim and ask for the missing controls.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces V-Reflection, a framework that transforms MLLMs from passive observers into active interrogators via a 'think-then-look' visual reflection mechanism. Latent states are positioned as dynamic probes that interrogate the visual feature space to ground each reasoning step with task-critical evidence. The method uses a two-stage distillation process: a Box-Guided Compression Module (BCM) teacher establishes pixel-to-latent spatial targets, followed by a Dynamic Autoregressive Compression (DAC) student that maps hidden states to dynamic probes. Both auxiliary modules are stated to be inactive at inference, preserving end-to-end autoregressive decoding. The paper claims this approach yields improvements across six perception-intensive benchmarks and narrows the fine-grained perception gap, supported by visualizations of autonomous localization.

Significance. If the distillation successfully internalizes active visual probing such that performance gains persist without the auxiliary modules at inference, the work would represent a meaningful advance in reducing perception hallucinations in MLLMs. The zero-inference-overhead design would be a practical strength, and the two-stage teacher-student transfer of spatial grounding into latent states offers a concrete technical path for making visual reasoning more dynamic and evidence-grounded.

major comments (1)

[Abstract] Abstract: The central claim that the two-stage distillation (BCM teacher to DAC student) internalizes localization so the base MLLM performs 'think-then-look' interrogation purely via its own latent states at inference is load-bearing. No ablation results are referenced that isolate whether benchmark gains survive complete removal of BCM/DAC post-training or whether latent states function as independent probes rather than benefiting from residual training signals.

minor comments (2)

[Abstract] Abstract: The abstract asserts 'extensive experiments' and improvements on 'six perception-intensive benchmarks' but supplies no benchmark names, numerical deltas, error bars, or ablation tables, making the magnitude and robustness of the claimed gains impossible to evaluate from the summary alone.
[Abstract] Abstract: The description of the DAC module mapping 'hidden states into dynamic probes' and the BCM establishing 'stable pixel-to-latent targets' introduces new entities without defining their architectures, loss formulations, or how the distillation objective is constructed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The concern about isolating the effect of internalized localization via latent states is well-taken, and we will strengthen the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the two-stage distillation (BCM teacher to DAC student) internalizes localization so the base MLLM performs 'think-then-look' interrogation purely via its own latent states at inference is load-bearing. No ablation results are referenced that isolate whether benchmark gains survive complete removal of BCM/DAC post-training or whether latent states function as independent probes rather than benefiting from residual training signals.

Authors: We agree that explicit isolation of the post-distillation gains is necessary to support the central claim. In the revised version we will add a dedicated ablation (new Table in Section 4.3) that completely disables both BCM and DAC at inference time and reports benchmark numbers for the base MLLM alone. This will directly test whether the performance improvements persist through latent-state probing without any auxiliary modules or residual training signals. We will also revise the abstract to cite these results, making the internalization argument explicit rather than implicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; framework is self-contained training procedure

full rationale

The paper introduces V-Reflection via a two-stage distillation (BCM teacher establishing pixel-to-latent targets, DAC student mapping hidden states to probes) that is presented as an independent training recipe whose outputs are then evaluated on external benchmarks. No equations appear in the provided text that would reduce the claimed 'think-then-look' interrogation or the post-distillation internalization to a fitted parameter or self-referential definition. The central claim that latent states function as autonomous probes at inference is supported by the described procedure and visualizations rather than by any self-citation chain or ansatz smuggled from prior work. Because the derivation does not collapse to its own inputs by construction and relies on observable benchmark gains, the analysis finds no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the effectiveness of newly introduced modules and the assumption that distillation internalizes visual probing without runtime overhead; no free parameters are explicitly fitted in the abstract, but the modules themselves are invented constructs.

axioms (1)

domain assumption Distillation from an explicit spatial teacher module can transfer localization ability to a latent-state student module that operates without the teacher at inference.
Invoked in the description of the two-stage distillation strategy.

invented entities (2)

Box-Guided Compression Module (BCM) no independent evidence
purpose: Establishes stable pixel-to-latent targets through explicit spatial grounding
New module introduced to create teacher targets for distillation.
Dynamic Autoregressive Compression (DAC) module no independent evidence
purpose: Maps hidden states into dynamic probes that interrogate the global visual feature map
New module introduced to enable think-then-look probing in the student.

pith-pipeline@v0.9.0 · 5595 in / 1451 out tokens · 68667 ms · 2026-05-13T23:54:41.042937+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction (8-tick period emergence) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

The latent reasoning steps S are set to 8 during both training and inference... latent states function as dynamic probes that actively interrogate the visual feature space
IndisputableMonolith/Cost/FunctionalEquation.lean J(x) = ½(x + x⁻¹) − 1 uniqueness and recognition cost forcing echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

DAC projects hidden states H into dynamic probes Qdyn that interrogate the global feature map Fglobal through cross-attention

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 15 internal anchors

[1]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Internvl: Scaling up vision foundation models and aligning for generic 11 visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic 11 visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 1

work page 2024
[5]

Compressed chain of thought: Efficient reasoning through dense representations.arXiv preprint arXiv:2412.13171,

Jeffrey Cheng and Benjamin Van Durme. Compressed chain of thought: Efficient reasoning through dense representations.arXiv preprint arXiv:2412.13171, 2024. 4

work page arXiv 2024
[6]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. In European Conference on Computer Vision, pages 148–166. Springer, 2024. 1, 7

work page 2024
[8]

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025. 2, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Rex-thinker: Grounded object referring via chain-of-thought reasoning.arXiv preprint arXiv:2506.04034, 2025

Qing Jiang, Xingyu Chen, Zhaoyang Zeng, Junzhi Yu, and Lei Zhang. Rex-thinker: Grounded object referring via chain-of-thought reasoning.arXiv preprint arXiv:2506.04034, 2025. 2, 3

work page arXiv 2025
[12]

Latent Visual Reasoning

Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025. 4, 8

work page internal anchor Pith review arXiv 2025
[13]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Visual-rft: Visual reinforcement fine-tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2034–2044, 2025. 2, 3

work page 2034
[15]

Point-rft: Improving multimodal reasoning with visually grounded reinforcement finetuning.arXiv preprint arXiv:2505.19702,

Minheng Ni, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Kevin Lin, Wangmeng Zuo, and Lijuan Wang. Point-rft: Improving multimodal reasoning with visually grounded reinforcement finetuning.arXiv preprint arXiv:2505.19702, 2025. 2, 3

work page arXiv 2025
[16]

Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025. 2, 3

work page arXiv 2025
[17]

Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens.arXiv preprint arXiv:2511.19418, 2025

Yiming Qin, Bomin Wei, Jiaxin Ge, Konstantinos Kallidromitis, Stephanie Fu, Trevor Darrell, and XuDong Wang. Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens.arXiv preprint arXiv:2511.19418, 2025. 4

work page arXiv 2025
[18]

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024. 2, 3, 7

work page 2024
[19]

Codi: Compressing chain-of-thought into continuous space via self-distillation

Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: Compressing chain-of-thought into continuous space via self-distillation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 677–693, 2025. 3

work page 2025
[20]

Openthinkimg: Learning to think with images via visual tool reinforcement learning

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025. 2, 3

work page arXiv 2025
[21]

Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv e-prints, pages arXiv–2503,

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv e-prints, pages arXiv–2503,

work page
[22]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024. 1, 7

work page 2024
[23]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966,

work page internal anchor Pith review arXiv
[24]

Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025

Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, and Yisen Wang. Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025. 4, 8

work page arXiv 2025
[25]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models.arXiv preprint, 2024

Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models.arXiv preprint, 2024. 1, 7

work page 2024
[27]

Perception-Aware Policy Optimization for Multimodal Reasoning

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware policy optimization for multimodal reasoning. arXiv preprint arXiv:2507.06448, 2025. 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 3

work page 2022
[29]

Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use.arXiv preprint arXiv:2505.19255, 2025

Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, and Klara Nahrstedt. Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use.arXiv preprint arXiv:2505.19255, 2025. 2, 3

work page arXiv 2025
[30]

V?: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084– 13094, 2024. 1, 7

work page 2024
[31]

Llava-cot: Let vision language models reason step-by-step

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2087–2098, 2025. 2, 3

work page 2087
[32]

Mc-bench: A benchmark for multi-context visual grounding in the era of mllms

Yunqiu Xu, Linchao Zhu, and Yi Yang. Mc-bench: A benchmark for multi-context visual grounding in the era of mllms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17675–17687, 2025. 1

work page 2025
[33]

R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2376–2385, 2025. 2, 3

work page 2025
[34]

Machine mental imagery: Empower multimodal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025

Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. Machine mental imagery: Empower multimodal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025. 4

work page arXiv 2025
[35]

arXiv preprint arXiv:2504.07954 , year =

En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, et al. Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954, 2025. 2, 3

work page arXiv 2025
[36]

Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv e-prints, pages arXiv–2505, 2025

Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv e-prints, pages arXiv–2505, 2025. 2, 3

work page 2025
[37]

Thyme: Think Beyond Images

Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025. 8

work page internal anchor Pith review arXiv 2025
[38]

Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024. 1, 7 13

work page arXiv 2024
[39]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025. 2, 3, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 1, 8 14 A Architecture Specifications and Implementation Details To facilitate reproducibility, we ...

work page internal anchor Pith review Pith/arXiv arXiv 2025