pith. machine review for the scientific record. sign in

arxiv: 2605.08802 · v2 · submitted 2026-05-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

CoLVR: Enhancing Exploratory Latent Visual Reasoning via Contrastive Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords latent visual reasoningcontrastive optimizationmultimodal large language modelsexploratory reasoningreinforcement learning post-trainingangle-based perturbationtrajectory contrastive reward
0
0 comments X

The pith

CoLVR replaces hard alignment losses with contrastive objectives to let latent states explore a wider semantic space during visual reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing latent visual reasoning methods force hidden states to match fixed visual features, which restricts how freely the model can explore reasoning paths. CoLVR introduces a contrastive training stage that perturbs angles in the latent space to produce more diverse representations, then adds a trajectory-level contrastive reward during reinforcement learning to refine those paths. The result is measurable gains on benchmarks that test exploratory reasoning while preserving or improving out-of-domain performance. The method treats the continuous hidden state as the primary carrier of reasoning rather than converting it to discrete tokens at every step.

Core claim

CoLVR learns exploratory latent representations by optimizing a contrastive objective that uses angle-based perturbations to expand the semantic latent space, then applies a latent trajectory contrastive reward in RL post-training to encourage diverse reasoning behaviors without forcing matches to predefined visual features.

What carries the argument

Latent contrastive objective with angle-based perturbation plus latent trajectory contrastive reward for RL post-training.

If this is right

  • Latent representations become measurably more diverse, producing 5.83 percent average gains on VSP and 8.00 percent on Jigsaw.
  • Out-of-domain generalization improves, shown by a 3.40 percent gain on MMStar.
  • Reasoning can stay in continuous latent space longer before any token decoding is required.
  • RL post-training can be guided directly by trajectory-level contrastive signals rather than outcome-only rewards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same contrastive machinery could be tested on other latent-sequence tasks such as long-horizon planning or audio reasoning.
  • If angle perturbation proves robust, future models might drop explicit visual-feature alignment stages entirely.
  • Diverse latent trajectories may reduce the need for large numbers of sampled reasoning chains at inference time.

Load-bearing premise

Angle-based perturbations and trajectory contrastive rewards will reliably enlarge the useful semantic space and produce diverse behaviors without adding new biases or hurting performance on ordinary tasks.

What would settle it

Training a multimodal model with CoLVR yields no gain or a drop on VSP and Jigsaw relative to the hard-alignment baseline, or produces lower scores on standard visual question answering tasks.

Figures

Figures reproduced from arXiv: 2605.08802 by Linjian Meng, Yiming Wu, Yuhan Li, Yuhao Liu, Zhen Zhao, Ziyang Ding.

Figure 1
Figure 1. Figure 1: (a) Comparison of Latent Token Exploratory Capacity Under Hard Alignment and Contrastive Optimization. Hard alignment enforces rigid feature matching and leads to fixed reasoning paths, while contrastive optimization encourages more flexible latent trajectories. (b) UMAP-based 2D visualization of the first four latent tokens from Mirage and CoLVR, where hidden states are projected into a two-dimensional sp… view at source ↗
Figure 2
Figure 2. Figure 2: The framework of CoLVR. After a warm-up stage, CoLVR perform latent contrastive learning by constructing positive and negative samples via angular perturbations to encourage exploratory latent token representations. Additionally, we introduce a latent trajectory-based reward within the GRPO process to further optimize and sustain the exploration capability of latent tokens. We then normalize both the traje… view at source ↗
Figure 3
Figure 3. Figure 3: (a) UMAP Visualization of Latent Tokens in Jigsaw task, Mirage and CoLVR exhibit the same trend as observed on VSP. (b) Inference Noise Injection. We introduce random noise ϵ ∼ N (0, σ2 ) into the latent tokens during the inference phase to evaluate their sensitivity and explorative capabilities. to 60.33%, with Tertis accuracy dropping from 42.67% to 38.67%. This suggests that traditional GRPO cannot effe… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of attention maps across VSP levels for Mirage and CoLVR. tokens learn more exploratory and discriminative features, yielding attention maps that distribute broadly across the global context. Consequently, CoLVR captures richer structural relationships and exhibits a markedly greater diversity in visual reasoning. A.3 Visualization of UMAP for a Single Case with Latent Tokens Rolled Out K Times … view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of latent token dispersion across models, measured by average distance to cluster centroids over 40 cases with 20 samples each. Additionally, we provide illustrative examples from a range of out-of-domain benchmark tasks, including VisPuzzle, V*, MMVP, MMStar, and CV-Bench. These examples are intended to highlight the diversity and complexity of evaluation scenarios beyond the training distribut… view at source ↗
Figure 6
Figure 6. Figure 6: Inference example: Jigsaw. input image image with hints Tertis Task Fill the exact yellow shape shown in the question grid. Choose the only option set whose pieces perfectly tile the shape without gaps or overlap. <image> Give only the final answer, wrapped in \\boxed{} \boxed{B} [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Inference example: Tertis. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 7
Figure 7. Figure 7: Inference example: Tertis. input image image with hints VSP Task As a professional maze solver, your task is to analyze a grid-based map and devise an action plan that enables a player to reach the goal from the starting point without falling into any holes, using the fewest possible moves. ## Game Setup - The game presents a fully observable grid-based map. - The player starts at a specified grid square, … view at source ↗
Figure 8
Figure 8. Figure 8: Inference example: VSP. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Inference examples: VisPuzzle, V*, MMVP, MMStar, and CV-Bench. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
read the original abstract

Due to the potential for exploratory reasoning of Latent Visual Reasoning, recent works tend to enable MLLMs (Multimodal Large Language Models) to perform visual reasoning by propagating continuous hidden states instead of decoding intermediate steps into discrete tokens. However, existing works typically rely on hard alignment objectives to force latent representations to match predefined visual features, thereby severely limiting the exploratory of latent reasoning process. To address this problem, we propose CoLVR (Contrastive Optimization for Latent Visual Reasoning). To obtain a more exploratory visual reasoning, CoLVR introduces a latent contrastive training framework. Firstly, CoLVR learns diverse and exploratory representations with a latent contrastive objective guided by angle-based perturbation, which expands the semantic latent space and avoids over-constrained embedding. Then, CoLVR employs a latent trajectory contrastive reward for RL (Reinforcement Learning) post-training to enable fine-grained optimization of latent visual reasoning process and thus fostering diverse reasoning behaviors. Experiments demonstrate that CoLVR significantly enhances the exploratory capability of latent representations, achieving average improvements of 5.83% on VSP and 8.00% on Jigsaw, while also outperforming existing latent models on out of domain benchmarks, with a 3.40% gain on MMStar. The data, codes, and models are released at https://github.com/Oscar-dzy/CoLVR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript proposes CoLVR, a contrastive optimization framework for latent visual reasoning in MLLMs. It replaces hard alignment objectives with a latent contrastive objective that uses angle-based perturbation to expand the semantic space, followed by a trajectory contrastive reward in RL post-training to promote diverse reasoning trajectories. Experiments report average gains of 5.83% on VSP, 8.00% on Jigsaw, and 3.40% on out-of-domain MMStar relative to prior latent models, with public code and model release.

Significance. If the reported gains prove robust, the approach offers a concrete alternative to over-constrained latent embeddings, potentially improving exploratory capacity in multimodal reasoning without sacrificing standard-task performance. The explicit contrastive objectives and public code release are strengths that support reproducibility and further testing of the exploration hypothesis.

major comments (3)
  1. [§4.2] §4.2 (latent contrastive objective): the angle-based perturbation is described as expanding the semantic space, but the manuscript does not specify whether the perturbation radius or sampling distribution is held constant across datasets or tuned per model; this leaves open whether the reported VSP/Jigsaw gains are attributable to the contrastive term or to implicit hyperparameter search.
  2. [Table 3] Table 3 (ablation on RL post-training): removing the trajectory contrastive reward drops performance by only 1.2–1.8 points on two of the three benchmarks; this modest delta weakens the central claim that the RL stage is required to foster diverse reasoning behaviors.
  3. [§5.3] §5.3 (out-of-domain evaluation): the 3.40% MMStar gain is presented without error bars, without a non-latent baseline, and without a statistical significance test; the improvement cannot yet be distinguished from run-to-run variance.
minor comments (3)
  1. [Abstract] Abstract, line 3: 'exploratory of latent reasoning process' is grammatically incomplete; rephrase to 'exploratory nature of the latent reasoning process'.
  2. [§3.1] §3.1: the contrastive loss notation mixes cosine similarity with an unspecified temperature parameter; align the symbols with standard InfoNCE notation for clarity.
  3. [Figure 2] Figure 2: the trajectory visualization lacks axis labels and a legend distinguishing positive/negative pairs; this reduces interpretability of the claimed diversity gain.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the positive assessment and constructive feedback on our manuscript. We address each of the major comments below and have revised the manuscript accordingly to improve clarity and robustness.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (latent contrastive objective): the angle-based perturbation is described as expanding the semantic space, but the manuscript does not specify whether the perturbation radius or sampling distribution is held constant across datasets or tuned per model; this leaves open whether the reported VSP/Jigsaw gains are attributable to the contrastive term or to implicit hyperparameter search.

    Authors: We will revise the manuscript to specify that the angle-based perturbation radius and the sampling distribution were held constant across datasets and models. The values were chosen based on preliminary experiments on a held-out validation set and kept fixed for all main experiments to allow fair comparisons. Supporting ablations in Table 2 show that the contrastive objective contributes significantly to the performance gains. revision: yes

  2. Referee: [Table 3] Table 3 (ablation on RL post-training): removing the trajectory contrastive reward drops performance by only 1.2–1.8 points on two of the three benchmarks; this modest delta weakens the central claim that the RL stage is required to foster diverse reasoning behaviors.

    Authors: The observed drop of 1.2-1.8 points indicates a meaningful contribution from the trajectory contrastive reward, particularly when considering the cumulative effect with the latent contrastive training. We will include additional discussion and metrics on reasoning diversity in the revised manuscript to reinforce the importance of the RL stage for fostering exploratory behaviors. revision: partial

  3. Referee: [§5.3] §5.3 (out-of-domain evaluation): the 3.40% MMStar gain is presented without error bars, without a non-latent baseline, and without a statistical significance test; the improvement cannot yet be distinguished from run-to-run variance.

    Authors: We agree that error bars, a non-latent baseline, and statistical testing would strengthen the out-of-domain results. We will update §5.3 to include standard deviations from multiple runs, comparisons to non-latent models, and the results of a statistical significance test. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The CoLVR framework defines its latent contrastive objective (angle-based perturbation) and trajectory contrastive reward explicitly as new training components, then measures resulting exploratory gains on independent external benchmarks (VSP, Jigsaw, MMStar). No equations reduce the reported improvements to fitted parameters or self-referential quantities inside the same loop; the central claims rest on standard contrastive optimization applied to latent states rather than any self-definition, self-citation load-bearing step, or renamed known result. The derivation is therefore self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard assumptions from contrastive learning and reinforcement learning; no new free parameters, axioms, or invented entities are explicitly introduced beyond typical hyperparameters such as perturbation angles and reward scaling.

axioms (2)
  • domain assumption Contrastive objectives with angular perturbations expand semantic coverage without over-constraining embeddings
    Invoked to justify the first training stage as enabling exploratory representations.
  • domain assumption Trajectory-level contrastive rewards in RL produce more diverse reasoning behaviors than standard objectives
    Invoked to justify the post-training stage.

pith-pipeline@v0.9.0 · 5555 in / 1410 out tokens · 52923 ms · 2026-05-13T07:04:20.179144+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    Qwen2.5-vl technical report, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025

  2. [2]

    Are we on the right way for evaluating large vision-language models?, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models?, 2024

  3. [3]

    Cogflow: Bridging perception and reasoning through knowledge internalization for visual mathematical problem solving, 2026

    Shuhang Chen, Yunqiu Xu, Junjie Xie, Aojun Lu, Tao Feng, Zeying Huang, Ning Zhang, Yi Sun, Yi Yang, and Hangjie Yuan. Cogflow: Bridging perception and reasoning through knowledge internalization for visual mathematical problem solving, 2026

  4. [4]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

  5. [5]

    Muco: Multi-turn contrastive learning for multimodal embedding model.arXiv preprint arXiv:2602.06393, 2026

    Geonmo Gu, Byeongho Heo, Jaemyung Yu, Jaehui Hwang, Taekyung Kim, Sangmin Lee, HeeJae Jun, Yoohoon Kang, Sangdoo Yun, and Dongyoon Han. Muco: Multi-turn contrastive learning for multimodal embedding model.arXiv preprint arXiv:2602.06393, 2026

  6. [6]

    Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning

    Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, and Yu Cheng. Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning. InThe F ourteenth International Conference on Learning Representations, 2026

  7. [7]

    Training large language models to reason in a continuous latent space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason E Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space. InSecond Conference on Language Modeling

  8. [8]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

  9. [9]

    Diffthinker: Towards generative multimodal reasoning with diffusion models, 2025

    Zefeng He, Xiaoye Qu, Yafu Li, Tong Zhu, Siyuan Huang, and Yu Cheng. Diffthinker: Towards generative multimodal reasoning with diffusion models, 2025

  10. [10]

    Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, 2025

    Shanghai AI Laboratory InternVL Team. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, 2025

  11. [11]

    Hallucination augmented contrastive learning for multimodal large language model

    Chaoya Jiang, Haiyang Xu, Mengfan Dong, Jiaxing Chen, Wei Ye, Ming Yan, Qinghao Ye, Ji Zhang, Fei Huang, and Shikun Zhang. Hallucination augmented contrastive learning for multimodal large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27036–27046, 2024

  12. [12]

    Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, Tom Goldstein, and Micah Goldblum

    Ang Li, Charles L. Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, Tom Goldstein, and Micah Goldblum. Zebra-cot: A dataset for interleaved vision-language reasoning. InThe F ourteenth International Conference on Learning Representations, 2026. 10

  13. [13]

    Latent visual reasoning

    Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning. InThe F ourteenth International Conference on Learning Representations, 2026

  14. [14]

    Cocova: Chain of continuous vision-language thought for latent space reasoning.arXiv e-prints, pages arXiv–2511, 2025

    Jizheng Ma, Xiaofei Zhou, Yanlong Song, and Han Yan. Cocova: Chain of continuous vision-language thought for latent space reasoning.arXiv e-prints, pages arXiv–2511, 2025

  15. [15]

    Umap: Uniform manifold approximation and projection for dimension reduction, 2020

    Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction, 2020

  16. [16]

    Gpt-4 technical report, 2024

    OpenAI. Gpt-4 technical report, 2024

  17. [17]

    Gpt-4o system card, 2024

    OpenAI. Gpt-4o system card, 2024

  18. [18]

    Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens.arXiv preprint arXiv:2511.19418, 2025

    Yiming Qin, Bomin Wei, Jiaxin Ge, Konstantinos Kallidromitis, Stephanie Fu, Trevor Darrell, and XuDong Wang. Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens.arXiv preprint arXiv:2511.19418, 2025

  19. [19]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

  20. [20]

    Enhancing discriminative ability in multimodal llms: A contrastive learning approach for ct report generation.Information Fusion, 123:103240, 2025

    Qingyong Su, Chong Feng, Ge Shi, Bo Wang, and Yan Zhuang. Enhancing discriminative ability in multimodal llms: A contrastive learning approach for ct report generation.Information Fusion, 123:103240, 2025

  21. [21]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

    Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

  22. [22]

    Kimi-vl technical report, 2025

    Kimi Team. Kimi-vl technical report, 2025

  23. [23]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024

  24. [24]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024

  25. [25]

    Monet: Reasoning in latent visual space beyond images and language, 2025

    Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, and Yisen Wang. Monet: Reasoning in latent visual space beyond images and language, 2025

  26. [26]

    Forest before trees: Latent superposition for efficient visual reasoning, 2026

    Yubo Wang, Juntian Zhang, Yichen Wu, Yankai Lin, Nils Lukas, and Yuhan Liu. Forest before trees: Latent superposition for efficient visual reasoning, 2026

  27. [27]

    V*: Guided visual search as a core mechanism in multimodal llms, 2023

    Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms, 2023

  28. [28]

    Vi- sual planning: Let’s think only with images

    Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, and Ivan Vuli ´c. Vi- sual planning: Let’s think only with images. InThe F ourteenth International Conference on Learning Representations, 2026

  29. [29]

    Machine mental imagery: Empower multimodal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025

    Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. Machine mental imagery: Empower multimodal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025

  30. [30]

    Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Tsung-Yi Lin, Gordon Wetzstein, Ming-Yu Liu, and Donglai Xiang. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p...

  31. [31]

    1\" and the right part is labeled \

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and XingYu. Deepeyes: Incentivizing ”thinking with images” via reinforcement learning. InThe F ourteenth Interna- tional Conference on Learning Representations, 2026. 11 A Appendix A.1 Detailed Experimental Settings All training procedures for CoLVR are conducted using the...