arxiv: 2605.08802 · v2 · submitted 2026-05-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

CoLVR: Enhancing Exploratory Latent Visual Reasoning via Contrastive Optimization

Ziyang Ding , Linjian Meng , Yiming Wu , Yuhan Li , Yuhao Liu , Zhen Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:04 UTC · model grok-4.3

classification 💻 cs.CV

keywords latent visual reasoningcontrastive optimizationmultimodal large language modelsexploratory reasoningreinforcement learning post-trainingangle-based perturbationtrajectory contrastive reward

0 comments

The pith

CoLVR replaces hard alignment losses with contrastive objectives to let latent states explore a wider semantic space during visual reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing latent visual reasoning methods force hidden states to match fixed visual features, which restricts how freely the model can explore reasoning paths. CoLVR introduces a contrastive training stage that perturbs angles in the latent space to produce more diverse representations, then adds a trajectory-level contrastive reward during reinforcement learning to refine those paths. The result is measurable gains on benchmarks that test exploratory reasoning while preserving or improving out-of-domain performance. The method treats the continuous hidden state as the primary carrier of reasoning rather than converting it to discrete tokens at every step.

Core claim

CoLVR learns exploratory latent representations by optimizing a contrastive objective that uses angle-based perturbations to expand the semantic latent space, then applies a latent trajectory contrastive reward in RL post-training to encourage diverse reasoning behaviors without forcing matches to predefined visual features.

What carries the argument

Latent contrastive objective with angle-based perturbation plus latent trajectory contrastive reward for RL post-training.

If this is right

Latent representations become measurably more diverse, producing 5.83 percent average gains on VSP and 8.00 percent on Jigsaw.
Out-of-domain generalization improves, shown by a 3.40 percent gain on MMStar.
Reasoning can stay in continuous latent space longer before any token decoding is required.
RL post-training can be guided directly by trajectory-level contrastive signals rather than outcome-only rewards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same contrastive machinery could be tested on other latent-sequence tasks such as long-horizon planning or audio reasoning.
If angle perturbation proves robust, future models might drop explicit visual-feature alignment stages entirely.
Diverse latent trajectories may reduce the need for large numbers of sampled reasoning chains at inference time.

Load-bearing premise

Angle-based perturbations and trajectory contrastive rewards will reliably enlarge the useful semantic space and produce diverse behaviors without adding new biases or hurting performance on ordinary tasks.

What would settle it

Training a multimodal model with CoLVR yields no gain or a drop on VSP and Jigsaw relative to the hard-alignment baseline, or produces lower scores on standard visual question answering tasks.

Figures

Figures reproduced from arXiv: 2605.08802 by Linjian Meng, Yiming Wu, Yuhan Li, Yuhao Liu, Zhen Zhao, Ziyang Ding.

**Figure 1.** Figure 1: (a) Comparison of Latent Token Exploratory Capacity Under Hard Alignment and Contrastive Optimization. Hard alignment enforces rigid feature matching and leads to fixed reasoning paths, while contrastive optimization encourages more flexible latent trajectories. (b) UMAP-based 2D visualization of the first four latent tokens from Mirage and CoLVR, where hidden states are projected into a two-dimensional sp… view at source ↗

**Figure 2.** Figure 2: The framework of CoLVR. After a warm-up stage, CoLVR perform latent contrastive learning by constructing positive and negative samples via angular perturbations to encourage exploratory latent token representations. Additionally, we introduce a latent trajectory-based reward within the GRPO process to further optimize and sustain the exploration capability of latent tokens. We then normalize both the traje… view at source ↗

**Figure 3.** Figure 3: (a) UMAP Visualization of Latent Tokens in Jigsaw task, Mirage and CoLVR exhibit the same trend as observed on VSP. (b) Inference Noise Injection. We introduce random noise ϵ ∼ N (0, σ2 ) into the latent tokens during the inference phase to evaluate their sensitivity and explorative capabilities. to 60.33%, with Tertis accuracy dropping from 42.67% to 38.67%. This suggests that traditional GRPO cannot effe… view at source ↗

**Figure 4.** Figure 4: Comparison of attention maps across VSP levels for Mirage and CoLVR. tokens learn more exploratory and discriminative features, yielding attention maps that distribute broadly across the global context. Consequently, CoLVR captures richer structural relationships and exhibits a markedly greater diversity in visual reasoning. A.3 Visualization of UMAP for a Single Case with Latent Tokens Rolled Out K Times … view at source ↗

**Figure 5.** Figure 5: Comparison of latent token dispersion across models, measured by average distance to cluster centroids over 40 cases with 20 samples each. Additionally, we provide illustrative examples from a range of out-of-domain benchmark tasks, including VisPuzzle, V*, MMVP, MMStar, and CV-Bench. These examples are intended to highlight the diversity and complexity of evaluation scenarios beyond the training distribut… view at source ↗

**Figure 6.** Figure 6: Inference example: Jigsaw. input image image with hints Tertis Task Fill the exact yellow shape shown in the question grid. Choose the only option set whose pieces perfectly tile the shape without gaps or overlap. <image> Give only the final answer, wrapped in \\boxed{} \boxed{B} [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Inference example: Tertis. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 7.** Figure 7: Inference example: Tertis. input image image with hints VSP Task As a professional maze solver, your task is to analyze a grid-based map and devise an action plan that enables a player to reach the goal from the starting point without falling into any holes, using the fewest possible moves. ## Game Setup - The game presents a fully observable grid-based map. - The player starts at a specified grid square, … view at source ↗

**Figure 8.** Figure 8: Inference example: VSP. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Inference examples: VisPuzzle, V*, MMVP, MMStar, and CV-Bench. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

read the original abstract

Due to the potential for exploratory reasoning of Latent Visual Reasoning, recent works tend to enable MLLMs (Multimodal Large Language Models) to perform visual reasoning by propagating continuous hidden states instead of decoding intermediate steps into discrete tokens. However, existing works typically rely on hard alignment objectives to force latent representations to match predefined visual features, thereby severely limiting the exploratory of latent reasoning process. To address this problem, we propose CoLVR (Contrastive Optimization for Latent Visual Reasoning). To obtain a more exploratory visual reasoning, CoLVR introduces a latent contrastive training framework. Firstly, CoLVR learns diverse and exploratory representations with a latent contrastive objective guided by angle-based perturbation, which expands the semantic latent space and avoids over-constrained embedding. Then, CoLVR employs a latent trajectory contrastive reward for RL (Reinforcement Learning) post-training to enable fine-grained optimization of latent visual reasoning process and thus fostering diverse reasoning behaviors. Experiments demonstrate that CoLVR significantly enhances the exploratory capability of latent representations, achieving average improvements of 5.83% on VSP and 8.00% on Jigsaw, while also outperforming existing latent models on out of domain benchmarks, with a 3.40% gain on MMStar. The data, codes, and models are released at https://github.com/Oscar-dzy/CoLVR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoLVR pairs angle-based latent contrastive loss with a trajectory contrastive RL reward to loosen hard alignment constraints in multimodal latent reasoning, and it ships code plus modest benchmark gains.

read the letter

The core move is replacing rigid alignment with two contrastive pieces: an angle-perturbed loss that widens the latent space during pre-training, then a trajectory-level reward that shapes RL post-training toward diverse paths. That pairing is the clearest novelty relative to the prior latent-reasoning papers cited in the abstract. The authors also release code and models, which lets others check the implementation directly. Reported lifts average 5.83% on VSP and 8% on Jigsaw, with a smaller 3.4% out-of-domain gain on MMStar. Those numbers are consistent with the stated goal of more exploratory behavior. The motivation section lays out the problem of over-constrained embeddings without exaggeration. The experimental design uses external benchmarks rather than internal fitting, so the circularity burden stays low. The main soft spot is the strength of the causal link. The abstract and stress-test note show benchmark improvements, but the summary lacks visible ablations, error bars, or controls that isolate the two new objectives from other training choices. Until the full experimental section is checked, it remains possible that standard factors in the MLLM pipeline explain part of the lift. The assumption that angle perturbation reliably expands semantics without new biases also needs direct verification in the results. This work is aimed at groups already training or fine-tuning multimodal models for visual reasoning tasks. A reader who wants concrete loss formulations to adapt would get immediate value from the released code. The paper is coherent on its own terms and supplies enough reproducible material to justify sending it to referees rather than desk rejection.

Referee Report

3 major / 3 minor

Summary. The manuscript proposes CoLVR, a contrastive optimization framework for latent visual reasoning in MLLMs. It replaces hard alignment objectives with a latent contrastive objective that uses angle-based perturbation to expand the semantic space, followed by a trajectory contrastive reward in RL post-training to promote diverse reasoning trajectories. Experiments report average gains of 5.83% on VSP, 8.00% on Jigsaw, and 3.40% on out-of-domain MMStar relative to prior latent models, with public code and model release.

Significance. If the reported gains prove robust, the approach offers a concrete alternative to over-constrained latent embeddings, potentially improving exploratory capacity in multimodal reasoning without sacrificing standard-task performance. The explicit contrastive objectives and public code release are strengths that support reproducibility and further testing of the exploration hypothesis.

major comments (3)

[§4.2] §4.2 (latent contrastive objective): the angle-based perturbation is described as expanding the semantic space, but the manuscript does not specify whether the perturbation radius or sampling distribution is held constant across datasets or tuned per model; this leaves open whether the reported VSP/Jigsaw gains are attributable to the contrastive term or to implicit hyperparameter search.
[Table 3] Table 3 (ablation on RL post-training): removing the trajectory contrastive reward drops performance by only 1.2–1.8 points on two of the three benchmarks; this modest delta weakens the central claim that the RL stage is required to foster diverse reasoning behaviors.
[§5.3] §5.3 (out-of-domain evaluation): the 3.40% MMStar gain is presented without error bars, without a non-latent baseline, and without a statistical significance test; the improvement cannot yet be distinguished from run-to-run variance.

minor comments (3)

[Abstract] Abstract, line 3: 'exploratory of latent reasoning process' is grammatically incomplete; rephrase to 'exploratory nature of the latent reasoning process'.
[§3.1] §3.1: the contrastive loss notation mixes cosine similarity with an unspecified temperature parameter; align the symbols with standard InfoNCE notation for clarity.
[Figure 2] Figure 2: the trajectory visualization lacks axis labels and a legend distinguishing positive/negative pairs; this reduces interpretability of the claimed diversity gain.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the positive assessment and constructive feedback on our manuscript. We address each of the major comments below and have revised the manuscript accordingly to improve clarity and robustness.

read point-by-point responses

Referee: [§4.2] §4.2 (latent contrastive objective): the angle-based perturbation is described as expanding the semantic space, but the manuscript does not specify whether the perturbation radius or sampling distribution is held constant across datasets or tuned per model; this leaves open whether the reported VSP/Jigsaw gains are attributable to the contrastive term or to implicit hyperparameter search.

Authors: We will revise the manuscript to specify that the angle-based perturbation radius and the sampling distribution were held constant across datasets and models. The values were chosen based on preliminary experiments on a held-out validation set and kept fixed for all main experiments to allow fair comparisons. Supporting ablations in Table 2 show that the contrastive objective contributes significantly to the performance gains. revision: yes
Referee: [Table 3] Table 3 (ablation on RL post-training): removing the trajectory contrastive reward drops performance by only 1.2–1.8 points on two of the three benchmarks; this modest delta weakens the central claim that the RL stage is required to foster diverse reasoning behaviors.

Authors: The observed drop of 1.2-1.8 points indicates a meaningful contribution from the trajectory contrastive reward, particularly when considering the cumulative effect with the latent contrastive training. We will include additional discussion and metrics on reasoning diversity in the revised manuscript to reinforce the importance of the RL stage for fostering exploratory behaviors. revision: partial
Referee: [§5.3] §5.3 (out-of-domain evaluation): the 3.40% MMStar gain is presented without error bars, without a non-latent baseline, and without a statistical significance test; the improvement cannot yet be distinguished from run-to-run variance.

Authors: We agree that error bars, a non-latent baseline, and statistical testing would strengthen the out-of-domain results. We will update §5.3 to include standard deviations from multiple runs, comparisons to non-latent models, and the results of a statistical significance test. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The CoLVR framework defines its latent contrastive objective (angle-based perturbation) and trajectory contrastive reward explicitly as new training components, then measures resulting exploratory gains on independent external benchmarks (VSP, Jigsaw, MMStar). No equations reduce the reported improvements to fitted parameters or self-referential quantities inside the same loop; the central claims rest on standard contrastive optimization applied to latent states rather than any self-definition, self-citation load-bearing step, or renamed known result. The derivation is therefore self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard assumptions from contrastive learning and reinforcement learning; no new free parameters, axioms, or invented entities are explicitly introduced beyond typical hyperparameters such as perturbation angles and reward scaling.

axioms (2)

domain assumption Contrastive objectives with angular perturbations expand semantic coverage without over-constraining embeddings
Invoked to justify the first training stage as enabling exploratory representations.
domain assumption Trajectory-level contrastive rewards in RL produce more diverse reasoning behaviors than standard objectives
Invoked to justify the post-training stage.

pith-pipeline@v0.9.0 · 5555 in / 1410 out tokens · 52923 ms · 2026-05-13T07:04:20.179144+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CoLVR introduces a latent contrastive training framework... angle-based perturbation... latent trajectory contrastive reward for RL post-training
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hard alignment objectives... severely limiting the exploratory of latent reasoning process

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

[1]

Qwen2.5-vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025

work page 2025
[2]

Are we on the right way for evaluating large vision-language models?, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models?, 2024

work page 2024
[3]

Cogflow: Bridging perception and reasoning through knowledge internalization for visual mathematical problem solving, 2026

Shuhang Chen, Yunqiu Xu, Junjie Xie, Aojun Lu, Tao Feng, Zeying Huang, Ning Zhang, Yi Sun, Yi Yang, and Hangjie Yuan. Cogflow: Bridging perception and reasoning through knowledge internalization for visual mathematical problem solving, 2026

work page 2026
[4]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

work page 2024
[5]

Muco: Multi-turn contrastive learning for multimodal embedding model.arXiv preprint arXiv:2602.06393, 2026

Geonmo Gu, Byeongho Heo, Jaemyung Yu, Jaehui Hwang, Taekyung Kim, Sangmin Lee, HeeJae Jun, Yoohoon Kang, Sangdoo Yun, and Dongyoon Han. Muco: Multi-turn contrastive learning for multimodal embedding model.arXiv preprint arXiv:2602.06393, 2026

work page arXiv 2026
[6]

Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning

Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, and Yu Cheng. Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning. InThe F ourteenth International Conference on Learning Representations, 2026

work page 2026
[7]

Training large language models to reason in a continuous latent space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason E Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space. InSecond Conference on Language Modeling

work page
[8]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

work page 2020
[9]

Diffthinker: Towards generative multimodal reasoning with diffusion models, 2025

Zefeng He, Xiaoye Qu, Yafu Li, Tong Zhu, Siyuan Huang, and Yu Cheng. Diffthinker: Towards generative multimodal reasoning with diffusion models, 2025

work page 2025
[10]

Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, 2025

Shanghai AI Laboratory InternVL Team. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, 2025

work page 2025
[11]

Hallucination augmented contrastive learning for multimodal large language model

Chaoya Jiang, Haiyang Xu, Mengfan Dong, Jiaxing Chen, Wei Ye, Ming Yan, Qinghao Ye, Ji Zhang, Fei Huang, and Shikun Zhang. Hallucination augmented contrastive learning for multimodal large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27036–27046, 2024

work page 2024
[12]

Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, Tom Goldstein, and Micah Goldblum

Ang Li, Charles L. Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, Tom Goldstein, and Micah Goldblum. Zebra-cot: A dataset for interleaved vision-language reasoning. InThe F ourteenth International Conference on Learning Representations, 2026. 10

work page 2026
[13]

Latent visual reasoning

Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning. InThe F ourteenth International Conference on Learning Representations, 2026

work page 2026
[14]

Cocova: Chain of continuous vision-language thought for latent space reasoning.arXiv e-prints, pages arXiv–2511, 2025

Jizheng Ma, Xiaofei Zhou, Yanlong Song, and Han Yan. Cocova: Chain of continuous vision-language thought for latent space reasoning.arXiv e-prints, pages arXiv–2511, 2025

work page 2025
[15]

Umap: Uniform manifold approximation and projection for dimension reduction, 2020

Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction, 2020

work page 2020
[16]

Gpt-4 technical report, 2024

OpenAI. Gpt-4 technical report, 2024

work page 2024
[17]

Gpt-4o system card, 2024

OpenAI. Gpt-4o system card, 2024

work page 2024
[18]

Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens.arXiv preprint arXiv:2511.19418, 2025

Yiming Qin, Bomin Wei, Jiaxin Ge, Konstantinos Kallidromitis, Stephanie Fu, Trevor Darrell, and XuDong Wang. Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens.arXiv preprint arXiv:2511.19418, 2025

work page arXiv 2025
[19]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

work page 2024
[20]

Enhancing discriminative ability in multimodal llms: A contrastive learning approach for ct report generation.Information Fusion, 123:103240, 2025

Qingyong Su, Chong Feng, Ge Shi, Bo Wang, and Yan Zhuang. Enhancing discriminative ability in multimodal llms: A contrastive learning approach for ct report generation.Information Fusion, 123:103240, 2025

work page 2025
[21]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

work page 2025
[22]

Kimi-vl technical report, 2025

Kimi Team. Kimi-vl technical report, 2025

work page 2025
[23]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024

work page 2024
[24]

Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024

work page 2024
[25]

Monet: Reasoning in latent visual space beyond images and language, 2025

Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, and Yisen Wang. Monet: Reasoning in latent visual space beyond images and language, 2025

work page 2025
[26]

Forest before trees: Latent superposition for efficient visual reasoning, 2026

Yubo Wang, Juntian Zhang, Yichen Wu, Yankai Lin, Nils Lukas, and Yuhan Liu. Forest before trees: Latent superposition for efficient visual reasoning, 2026

work page 2026
[27]

V*: Guided visual search as a core mechanism in multimodal llms, 2023

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms, 2023

work page 2023
[28]

Vi- sual planning: Let’s think only with images

Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, and Ivan Vuli ´c. Vi- sual planning: Let’s think only with images. InThe F ourteenth International Conference on Learning Representations, 2026

work page 2026
[29]

Machine mental imagery: Empower multimodal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025

Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. Machine mental imagery: Empower multimodal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025

work page arXiv 2025
[30]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Tsung-Yi Lin, Gordon Wetzstein, Ming-Yu Liu, and Donglai Xiang. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p...

work page 2025
[31]

1\" and the right part is labeled \

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and XingYu. Deepeyes: Incentivizing ”thinking with images” via reinforcement learning. InThe F ourteenth Interna- tional Conference on Learning Representations, 2026. 11 A Appendix A.1 Detailed Experimental Settings All training procedures for CoLVR are conducted using the...

work page 2026