pith. sign in

arxiv: 2606.31986 · v1 · pith:PVOEXWVCnew · submitted 2026-06-30 · 💻 cs.CV

CoLT: Teaching Multi-Modal Models to Think with Chain of Latent Thoughts

Pith reviewed 2026-07-01 05:39 UTC · model grok-4.3

classification 💻 cs.CV
keywords chain of latent thoughtsmulti-modal large language modelsvisual reasoninglatent reasoningchain-of-thoughtinference efficiencymulti-modal models
0
0 comments X

The pith

Multi-modal models can reason with chains of just three latent thoughts instead of thousands of text tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Text chain-of-thought forces multi-modal models to output long explicit reasoning sequences that slow inference and limit what can be expressed. CoLT instead trains the model to produce short chains of latent representations that stand in for those steps. A lightweight external decoder supplies the needed guidance during training by turning each latent state into the next text reasoning step and by aligning its own states backward to the model's latent states given earlier text, while internal losses keep successive latent states coherent. The decoder and extra losses are discarded at inference, leaving only the fast latent chain. Experiments across eight visual reasoning benchmarks show gains over both text CoT and earlier latent methods together with large reductions in inference and decoding time.

Core claim

CoLT teaches multi-modal large language models to carry out chain-of-thought reasoning inside latent space with chains as short as three steps. Supervision comes from a lightweight external decoder that operates in two modes: forward decoding of each latent thought into the textual reasoning for the next step, and backward alignment of decoder hidden states to the model's latent thoughts given preceding text. Additional internal supervision encourages coherent step-by-step transitions among the latent states. Both the decoder and the internal losses are removed after training so that inference runs entirely on the latent chain.

What carries the argument

Lightweight external decoder providing forward decoding and backward alignment supervision, combined with internal coherence losses on latent transitions.

If this is right

  • CoLT exceeds both text-based CoT and prior latent reasoning methods such as CODI and SIM-CoT on eight visual reasoning benchmarks.
  • CoLT also exceeds latent visual reasoning methods that require auxiliary images and costly annotations.
  • Inference time is reduced by a factor of 10.1 relative to text CoT.
  • Text decoding time is reduced by a factor of 22.6 relative to text CoT.
  • Effective reasoning occurs with latent chains of only three steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Once the decoder is removed, the learned latent transitions may allow the model to handle longer or more intricate visual tasks that would otherwise require prohibitive numbers of text tokens.
  • The same supervision pattern could be tested on non-visual modalities to see whether latent chains transfer.
  • The fact that external supervision can be dropped at inference suggests the model internalizes a stable reasoning structure that may generalize to new tasks without retraining the decoder.

Load-bearing premise

The external decoder's bidirectional supervision together with internal coherence losses will reliably prevent meaningless latent states and training collapse when the model is forced to reason without producing text.

What would settle it

Remove the decoder and internal losses, train the model to generate latent thoughts on the same benchmarks, and check whether accuracy falls sharply or training becomes unstable.

Figures

Figures reproduced from arXiv: 2606.31986 by Liang Wan, Lianyu Hu, Qing Guo, Shengqian Qin, Wei Feng, Yang Liu, Zeqin Liao.

Figure 1
Figure 1. Figure 1: Comparison of reasoning paradigms and overview of CoLT. Left: (1) text CoT generates verbose reasoning tokens before the answer; (2) latent reasoning replaces text with latent states but lacks explicit supervision; (3) latent visual reasoning aligns latent states with extra image annotations via an encoder. Right: CoLT generates latent thought vectors h1, . . . , hK regulated by two complementary mechanism… view at source ↗
Figure 2
Figure 2. Figure 2: Two qualitative examples from MathVista with decoded latent thoughts. We use the forward decoder to project latent states back into text. Color-coded segments indicate reasoning content decoded from different latent steps. 5 Conclusion We presented CoLT (Chain of Latent Thoughts), a novel framework that teaches multi-modal models to reason through latent thought representations instead of explicit text-bas… view at source ↗
read the original abstract

Chain-of-thought (CoT) reasoning has enabled multi-modal large language models (MLLMs) to tackle complex visual reasoning tasks by generating explicit intermediate reasoning steps in natural language. However, this text-based reasoning paradigm is inherently slow at inference time with even thousands of tokens and fundamentally constrained by the expressiveness of natural language. In this paper, we propose CoLT, (Chain of Latent Thoughts), a novel framework that teaches multi-modal models to reason through a chain of latent thought representations instead of verbose text tokens, which can perform thinking with as few as 3 steps. Naively forcing the model to think with latent states easily produces meaningless semantics and makes training unstable. To effectively regulate the latent reasoning process, we introduce a lightweight external decoder that provides step-level supervision for each latent reasoning step in two complementary directions: a forward mode that decodes latent thoughts into the textual reasoning of the next step, and a backward mode that aligns decoder hidden states with the model's latent thoughts given preceding textual context. We further incorporate internal supervision that encourages coherent step-by-step latent transitions. The decoder and internal supervision are removed during inference to maintain high efficiency of latent reasoning. Extensive experiments on eight benchmarks demonstrate that CoLT not only outperforms existing latent reasoning methods such as CODI and SIM-CoT, but also surpasses latent visual reasoning approaches that rely on auxiliary images with costly annotation requirements. Compared to text CoT methods, CoLT can notably reduce the inference time by 10.1$\times$ and text decoding time by 22.6$\times$. Code is released at https://github.com/hulianyuyy/CoLT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes CoLT, a framework that enables multi-modal LLMs to perform chain-of-thought reasoning via short chains (as few as 3 steps) of latent representations instead of explicit text tokens. A lightweight external decoder supplies step-level supervision in forward mode (decoding latents to next-step text) and backward mode (aligning decoder hidden states to latents given prior text), supplemented by an internal coherence loss on consecutive latent states; both decoder and losses are removed at inference. Experiments on eight benchmarks claim outperformance over CODI, SIM-CoT, and auxiliary-image latent methods, plus 10.1× inference-time and 22.6× text-decoding-time reductions versus text CoT baselines.

Significance. If the central claim holds—that the learned latent chains encode usable reasoning content that survives removal of the external decoder—this would represent a meaningful advance in efficient multi-modal reasoning by sidestepping the token overhead and expressivity limits of text CoT. The public code release is a positive factor for reproducibility.

major comments (3)
  1. [Method section] Method section (description of forward/backward decoder and coherence loss): no ablation or controlled experiment is described that isolates the latent states from decoder gradients (e.g., by freezing or removing the decoder during training and measuring downstream performance). This is load-bearing for the claim that gains derive from genuine latent reasoning rather than decoder-dependent artifacts.
  2. [Experiments] Experiments / results: the manuscript asserts stable training and performance gains but supplies no quantitative metrics on training stability (loss curves, variance across seeds), decoder-mode ablations, or error analysis of the 3-step latent chains. Without these, it is impossible to verify that the external supervision reliably prevents meaningless latent semantics.
  3. [Results] Results section: no probing, nearest-neighbor analysis, or intervention study is reported that directly tests whether the latent representations remain semantically functional once the decoder is stripped at inference. This evidence gap directly affects the strongest claim that CoLT outperforms prior latent methods because of latent-space reasoning.
minor comments (2)
  1. [Abstract] Abstract: the reported speed-up factors (10.1×, 22.6×) would be more informative if accompanied by the exact baseline configurations and hardware used.
  2. [Method section] Notation: the distinction between “latent thought representations” and the model’s internal hidden states could be clarified with an explicit diagram or equation in the method section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where additional evidence would strengthen the claims regarding latent-space reasoning. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Method section] Method section (description of forward/backward decoder and coherence loss): no ablation or controlled experiment is described that isolates the latent states from decoder gradients (e.g., by freezing or removing the decoder during training and measuring downstream performance). This is load-bearing for the claim that gains derive from genuine latent reasoning rather than decoder-dependent artifacts.

    Authors: We agree that an ablation isolating the contribution of decoder gradients is necessary to support the claim of genuine latent reasoning. In the revised manuscript we will add a controlled experiment in which the external decoder is either removed entirely or its parameters are frozen after an initial warm-up phase, with downstream benchmark performance reported for comparison against the full CoLT training procedure. This will quantify the extent to which the learned latents depend on ongoing decoder supervision versus functioning independently. revision: yes

  2. Referee: [Experiments] Experiments / results: the manuscript asserts stable training and performance gains but supplies no quantitative metrics on training stability (loss curves, variance across seeds), decoder-mode ablations, or error analysis of the 3-step latent chains. Without these, it is impossible to verify that the external supervision reliably prevents meaningless latent semantics.

    Authors: We acknowledge the absence of these quantitative diagnostics. The revised manuscript will include (i) training loss curves for both forward and backward modes, (ii) performance variance across at least three random seeds, (iii) an ablation comparing forward-only, backward-only, and combined decoder supervision, and (iv) a qualitative error analysis of the 3-step latent chains on representative failure cases from the benchmarks. revision: yes

  3. Referee: [Results] Results section: no probing, nearest-neighbor analysis, or intervention study is reported that directly tests whether the latent representations remain semantically functional once the decoder is stripped at inference. This evidence gap directly affects the strongest claim that CoLT outperforms prior latent methods because of latent-space reasoning.

    Authors: The primary support for post-decoder semantic functionality is the consistent outperformance over CODI, SIM-CoT, and auxiliary-image baselines on eight benchmarks after decoder removal. Nevertheless, we agree that direct tests would be more conclusive. In revision we will add a nearest-neighbor analysis of the learned latent states against text embeddings of reasoning steps, together with a simple intervention study that perturbs individual latent vectors and measures the effect on final answer accuracy. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is an independent training procedure evaluated on external benchmarks.

full rationale

The paper describes a training procedure that uses an external lightweight decoder (forward/backward modes) plus internal coherence losses to supervise latent-state transitions, then removes the decoder at inference. No equations, fitted parameters, or self-citations are presented that would make the reported gains or latent-reasoning claims equivalent to the inputs by construction. Performance is measured on eight external benchmarks against baselines (CODI, SIM-CoT, text CoT), satisfying the criteria for a self-contained, non-circular derivation. The skeptic concern about decoder-dependent artifacts is a correctness/empirical-validity issue, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond the standard assumption that an MLLM can be fine-tuned with auxiliary decoder losses. The latent-thought representation itself is treated as a modeling choice rather than a new physical entity.

pith-pipeline@v0.9.1-grok · 5842 in / 1239 out tokens · 26157 ms · 2026-07-01T05:39:16.075717+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 31 canonical work pages · 24 internal anchors

  1. [1]

    Threshold-Guided Optimization for Visual Generative Models

    Bai, J., Lei, Y., Shi, Q., Feng, A., Xin, Y., Zhao, Z., Shen, F., Yu, K., Li, J.: Threshold-guided optimization for visual generative models. arXiv preprint arXiv:2605.04653 (2026)

  2. [2]

    Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models

    Bai, J., Li, Y., Zhu, Y., Xin, Y., Shi, Q., Feng, A., Liu, X., Tao, M., Xue, J., Li, X., et al.: Prism: Efficient test-time scaling via hierarchical search and self-verification for discrete diffusion language models. arXiv preprint arXiv:2602.01842 (2026)

  3. [3]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  4. [4]

    Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Wang, J., Qiao, Y., Lin, D., et al.: Are we on the right way for evaluating large vision- language models? Advances in Neural Information Processing Systems37, 27056– 27087 (2024)

  5. [5]

    Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    Chen, Q., Qin, L., Liu, J., Peng, D., Guan, J., Wang, P., Hu, M., Zhou, Y., Gao, T., Che, W.: Towards reasoning era: A survey of long chain-of-thought for reasoning large language models. arXiv preprint arXiv:2503.09567 (2025)

  6. [6]

    arXiv preprint arXiv:2505.16782 (2025)

    Chen, X., Zhao, A., Xia, H., Lu, X., Wang, H., Chen, Y., Zhang, W., Wang, J., Li, W., Shen, X.: Reasoning beyond language: A comprehensive survey on latent chain-of-thought reasoning. arXiv preprint arXiv:2505.16782 (2025)

  7. [7]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)

  8. [8]

    Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

    Cheng, J., Van Durme, B.: Compressed chain of thought: Efficient reasoning through dense representations. arXiv preprint arXiv:2412.13171 (2024)

  9. [9]

    Implicit chain of thought reasoning via knowledge distillation

    Deng, Y., Prasad, K., Fernandez, R., Smolensky, P., Chaudhary, V., Shieber, S.: Implicit chain of thought reasoning via knowledge distillation. arXiv preprint arXiv:2311.01460 (2023)

  10. [10]

    OneThinker: All-in-one Reasoning Model for Image and Video

    Feng, K., Zhang, M., Li, H., Fan, K., Chen, S., Jiang, Y., Zheng, D., Sun, P., Zhang, Y., Sun, H., et al.: Onethinker: All-in-one reasoning model for image and video. arXiv preprint arXiv:2512.03043 (2025)

  11. [11]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Feng, K., Zhang, M., Li, H., Fan, K., Chen, S., Jiang, Y., Zheng, D., Sun, P., Zhang, Y., Sun, H., et al.: Onethinker: All-in-one reasoning model for image and video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5432–5443 (2026)

  12. [12]

    Fu, R., Li, Y., Zhang, Z., Wu, J., Liu, Y., Cao, S., Zeng, Y., Zhang, Y., Du, X., Fong, S.: Neurosymactive: Differentiable neural-symbolic reasoning with active ex- plorationforknowledgegraphquestionanswering.arXivpreprintarXiv:2602.15353 (2026)

  13. [13]

    In: Proceedings of the ACM Web Conference

    Fu, R., Wang, Y., Xu, T., Liu, Y., Tang, W., Wu, W., Ma, X., Fong, S.: S-path- rag: Semantic-aware shortest-path retrieval augmented generation for multi-hop knowledge graph question answering. In: Proceedings of the ACM Web Conference

  14. [14]

    4057–4068 (2026)

    pp. 4057–4068 (2026)

  15. [15]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

  16. [16]

    Training Large Language Models to Reason in a Continuous Latent Space

    Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., Tian, Y.: Train- ing large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769 (2024) CoLT: Chain of Latent Thoughts 17

  17. [17]

    Advances in Neural Information Processing Systems (2025)

    He, Y., Zheng, W., Zhu, Y., Zheng, Z., Su, L., Vasudevan, S., Guo, Q., Hong, L., Li, J.: Semcot: Accelerating chain-of-thought reasoning through semantically-aligned implicit tokens. Advances in Neural Information Processing Systems (2025)

  18. [18]

    International Conference on Learning Representation (2026)

    Hu, L., Gao, L., Shang, F., Wan, L., Feng, W.: illava: An image is worth fewer than 1/3 input tokens in large multimodal models. International Conference on Learning Representation (2026)

  19. [19]

    International Conference on Machine Learning (2026)

    Hu, L., Ma, X., Liao, Z., Liu, Y.: Tvi-cot: Text-visual interleaved chain-of-thought reasoning for multimodal understanding. International Conference on Machine Learning (2026)

  20. [20]

    arXiv preprint arXiv:2601.09668 (2026)

    Huang, A., Yao, C., Han, C., Wan, F., Guo, H., Lv, H., Zhou, H., Wang, J., Zhou, J., Sun, J., et al.: Step3-vl-10b technical report. arXiv preprint arXiv:2601.09668 (2026)

  21. [21]

    OpenAI o1 System Card

    Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al.: Openai o1 system card. arXiv preprint arXiv:2412.16720 (2024)

  22. [22]

    In: European conference on computer vision

    Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A di- agram is worth a dozen images. In: European conference on computer vision. pp. 235–251. Springer (2016)

  23. [23]

    International Conference on Learning Repre- sentation (2026)

    Li, B., Sun, X., Liu, J., Wang, Z., Wu, J., Yu, X., Chen, H., Barsoum, E., Chen, M., Liu, Z.: Latent visual reasoning. International Conference on Learning Repre- sentation (2026)

  24. [24]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

  25. [25]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, B., Ge, Y., Ge, Y., Wang, G., Wang, R., Zhang, R., Shan, Y.: Seed- bench: Benchmarking multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13299– 13308 (2024)

  26. [26]

    Computational Visual Media 10(4), 741–752 (2024)

    Li, J., Huang, Y., Wu, M., Zhang, B., Ji, X., Zhang, C.: Clip-sp: Vision-language model with adaptive prompting for scene parsing. Computational Visual Media 10(4), 741–752 (2024)

  27. [27]

    In: The Twelfth International Conference on Learning Representations (2024)

    Li, Z., Liu, H., Zhou, D., Ma, T.: Chain of thought empowers transformers to solve inherently serial problems. In: The Twelfth International Conference on Learning Representations (2024)

  28. [28]

    Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llavanext: Improved reasoning, ocr, and world knowledge (2024)

  29. [29]

    Advances in neural information processing systems36, 34892–34916 (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

  30. [30]

    Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: European conference on computer vision. pp. 216–233. Springer (2024)

  31. [31]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  32. [32]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023)

  33. [33]

    Advances in Neural Information Processing Systems 35, 2507–2521 (2022) 18 Lianyu Hu et al

    Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems 35, 2507–2521 (2022) 18 Lianyu Hu et al

  34. [34]

    PhotoCraft: Agentic Reasoning with Hierarchical Self-Evolving Memory for Deep Image Search

    Lyu, K., Yuan, Z., He, J., Yan, Q., Su, X., Hu, N., Liu, Y., Hao, C., Qin, S., Hu, L., et al.: Photocraft: Agentic reasoning with hierarchical self-evolving memory for deep image search. arXiv preprint arXiv:2606.03099 (2026)

  35. [35]

    In: Findings of the association for computational linguistics: ACL 2022

    Masry, A., Tan, J.Q., Joty, S., Hoque, E., et al.: Chartqa: A benchmark for ques- tion answering about charts with visual and logical reasoning. In: Findings of the association for computational linguistics: ACL 2022. pp. 2263–2279 (2022)

  36. [36]

    Advances in Neural Information Processing Systems37, 8612–8642 (2024)

    Shao, H., Qian, S., Xiao, H., Song, G., Zong, Z., Wang, L., Liu, Y., Li, H.: Vi- sual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems37, 8612–8642 (2024)

  37. [37]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

  38. [38]

    In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

    Shen, Z., Yan, H., Zhang, L., Hu, Z., Du, Y., He, Y.: Codi: Compressing chain- of-thought into continuous space via self-distillation. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 677–693 (2025)

  39. [39]

    Visual Intelligence4(1), 14 (2026)

    Shi, H., Liu, W., Li, Z., Fang, X., Meng, X., Peng, W., Zhong, H., Liu, M., Wang, Y.: Intelligent robot systems: a survey from the perspective of visual intelligence. Visual Intelligence4(1), 14 (2026)

  40. [40]

    OpenAI GPT-5 System Card

    Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2026)

  41. [41]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Singh, A., Natarjan, V., Shah, M., Jiang, Y., Chen, X., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8317–8326 (2019)

  42. [42]

    arXiv preprint arXiv:2510.23925 (2025)

    Sun, G., Hua, H., Wang, J., Luo, J., Dianat, S., Rabbani, M., Rao, R., Tao, Z.: La- tent chain-of-thought for visual reasoning. arXiv preprint arXiv:2510.23925 (2025)

  43. [43]

    arXiv preprint arXiv:2505.16552 (2025)

    Tan, W., Li, J., Ju, J., Luo, Z., Song, R., Luan, J.: Think silently, think fast: Dy- namic latent compression of llm reasoning chains. arXiv preprint arXiv:2505.16552 (2025)

  44. [44]

    Team, Q.: Qwq: Reflect deeply on the boundaries of the unknown (2024)

  45. [45]

    Computational Visual Media11(1), 1–28 (2025)

    Wang, C., Peng, H.Y., Liu, Y.T., Gu, J., Hu, S.M.: Diffusion models for 3d gener- ation: A survey. Computational Visual Media11(1), 1–28 (2025)

  46. [46]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

  47. [47]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (2026)

    Wang, Q., Shi, Y., Wang, Y., Zhang, Y., Wan, P., Gai, K., Ying, X., Wang, Y.: Monet: Reasoning in latent visual space beyond images and language. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2026)

  48. [48]

    International Conference on Learning Representation (2023)

    Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. International Conference on Learning Representation (2023)

  49. [49]

    Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

    Wang, Y., Wu, S., Zhang, Y., Yan, S., Liu, Z., Luo, J., Fei, H.: Multimodal chain- of-thought reasoning: A comprehensive survey. arXiv preprint arXiv:2503.12605 (2025)

  50. [50]

    Advances in neural information processing systems35, 24824–24837 (2022) CoLT: Chain of Latent Thoughts 19

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022) CoLT: Chain of Latent Thoughts 19

  51. [51]

    International Conference on Learning Representation (2026)

    Wei, X., Liu, X., Zang, Y., Dong, X., Cao, Y., Wang, J., Qiu, X., Lin, D.: Sim- cot: Supervised implicit chain-of-thought. International Conference on Learning Representation (2026)

  52. [52]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Xiao, J., Wu, Z., Lin, H., Chen, Y., Liu, Y., Zhao, X., Wang, Z., He, Z.: Not just what’sthere:Enablingcliptocomprehendnegatedvisualdescriptionswithoutfine- tuning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 40, pp. 10978–10986 (2026)

  53. [53]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Xu, G., Jin, P., Wu, Z., Li, H., Song, Y., Sun, L., Yuan, L.: Llava-cot: Let vision lan- guage models reason step-by-step. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2087–2098 (2025)

  54. [54]

    In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Xu, Y., Guo, X., Zeng, Z., Miao, C.: Softcot: Soft chain-of-thought for efficient reasoning with llms. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 23336–23351 (2025)

  55. [55]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

  56. [56]

    5-xcoder: Multi-agent collaboration for multilingual code instruction tuning

    Yang, J., Zhang, W., Miao, Y., Quan, S., Wu, Z., Peng, Q., Yang, L., Liu, T., Cui, Z., Hui, B., et al.: Qwen2. 5-xcoder: Multi-agent collaboration for multilingual code instruction tuning. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 13121–13131 (2025)

  57. [57]

    arXiv preprint arXiv:2510.01623 (2025)

    Ye, A., Zhang, Z., Wang, B., Wang, X., Zhang, D., Zhu, Z.: Vla-r1: Enhancing rea- soning in vision-language-action models. arXiv preprint arXiv:2510.01623 (2025)

  58. [58]

    arXiv preprint arXiv:2404.16006 (2024)

    Ying, K., Meng, F., Wang, J., Li, Z., Lin, H., Yang, Y., Zhang, H., Zhang, W., Lin, Y., Liu, S., et al.: Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi. arXiv preprint arXiv:2404.16006 (2024)

  59. [59]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9556– 9567 (2024)

  60. [60]

    GLM-5: from Vibe Coding to Agentic Engineering

    Zeng, A., Lv, X., Hou, Z., Du, Z., Zheng, Q., Chen, B., Yin, D., Ge, C., Xie, C., Wang, C., et al.: Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763 (2026)

  61. [61]

    In: Proceedings of the 33rd ACM International Confer- ence on Multimedia

    Zhang, S., Hao, X., Tang, Y., Zhang, L., Wang, P., Wang, Z., Ma, H., Zhang, S.: Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of-thought. In: Proceedings of the 33rd ACM International Confer- ence on Multimedia. pp. 12745–12752 (2025)

  62. [62]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain- of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023)

  63. [63]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Zhao, Q., Lu, Y., Kim, M.J., Fu, Z., Zhang, Z., Wu, Y., Li, Z., Ma, Q., Han, S., Finn, C., et al.: Cot-vla: Visual chain-of-thought reasoning for vision-language- action models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1702–1713 (2025)

  64. [64]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Zhu, C., Lin, Y., Chen, S., Wang, Y., Lin, J.: Medeyes: Learning dynamic visual focus for medical progressive diagnosis. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 40, pp. 13916–13924 (2026)

  65. [65]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Zhu, C., Lin, Y., Shao, J., Lin, J., Wang, Y.: Pathology-aware prototype evolution via llm-driven semantic disambiguation for multicenter diabetic retinopathy diag- nosis. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 9196–9205 (2025) 20 Lianyu Hu et al

  66. [66]

    Zhu, C., Zeng, J., Jiang, J., Lin, J., Wang, Y.: Medsynapse-v: Bridging visual perception and clinical intuition via latent memory evolution (2026),https:// arxiv.org/abs/2604.26283

  67. [67]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025)