CoLT: Teaching Multi-Modal Models to Think with Chain of Latent Thoughts
Pith reviewed 2026-07-01 05:39 UTC · model grok-4.3
The pith
Multi-modal models can reason with chains of just three latent thoughts instead of thousands of text tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoLT teaches multi-modal large language models to carry out chain-of-thought reasoning inside latent space with chains as short as three steps. Supervision comes from a lightweight external decoder that operates in two modes: forward decoding of each latent thought into the textual reasoning for the next step, and backward alignment of decoder hidden states to the model's latent thoughts given preceding text. Additional internal supervision encourages coherent step-by-step transitions among the latent states. Both the decoder and the internal losses are removed after training so that inference runs entirely on the latent chain.
What carries the argument
Lightweight external decoder providing forward decoding and backward alignment supervision, combined with internal coherence losses on latent transitions.
If this is right
- CoLT exceeds both text-based CoT and prior latent reasoning methods such as CODI and SIM-CoT on eight visual reasoning benchmarks.
- CoLT also exceeds latent visual reasoning methods that require auxiliary images and costly annotations.
- Inference time is reduced by a factor of 10.1 relative to text CoT.
- Text decoding time is reduced by a factor of 22.6 relative to text CoT.
- Effective reasoning occurs with latent chains of only three steps.
Where Pith is reading between the lines
- Once the decoder is removed, the learned latent transitions may allow the model to handle longer or more intricate visual tasks that would otherwise require prohibitive numbers of text tokens.
- The same supervision pattern could be tested on non-visual modalities to see whether latent chains transfer.
- The fact that external supervision can be dropped at inference suggests the model internalizes a stable reasoning structure that may generalize to new tasks without retraining the decoder.
Load-bearing premise
The external decoder's bidirectional supervision together with internal coherence losses will reliably prevent meaningless latent states and training collapse when the model is forced to reason without producing text.
What would settle it
Remove the decoder and internal losses, train the model to generate latent thoughts on the same benchmarks, and check whether accuracy falls sharply or training becomes unstable.
Figures
read the original abstract
Chain-of-thought (CoT) reasoning has enabled multi-modal large language models (MLLMs) to tackle complex visual reasoning tasks by generating explicit intermediate reasoning steps in natural language. However, this text-based reasoning paradigm is inherently slow at inference time with even thousands of tokens and fundamentally constrained by the expressiveness of natural language. In this paper, we propose CoLT, (Chain of Latent Thoughts), a novel framework that teaches multi-modal models to reason through a chain of latent thought representations instead of verbose text tokens, which can perform thinking with as few as 3 steps. Naively forcing the model to think with latent states easily produces meaningless semantics and makes training unstable. To effectively regulate the latent reasoning process, we introduce a lightweight external decoder that provides step-level supervision for each latent reasoning step in two complementary directions: a forward mode that decodes latent thoughts into the textual reasoning of the next step, and a backward mode that aligns decoder hidden states with the model's latent thoughts given preceding textual context. We further incorporate internal supervision that encourages coherent step-by-step latent transitions. The decoder and internal supervision are removed during inference to maintain high efficiency of latent reasoning. Extensive experiments on eight benchmarks demonstrate that CoLT not only outperforms existing latent reasoning methods such as CODI and SIM-CoT, but also surpasses latent visual reasoning approaches that rely on auxiliary images with costly annotation requirements. Compared to text CoT methods, CoLT can notably reduce the inference time by 10.1$\times$ and text decoding time by 22.6$\times$. Code is released at https://github.com/hulianyuyy/CoLT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CoLT, a framework that enables multi-modal LLMs to perform chain-of-thought reasoning via short chains (as few as 3 steps) of latent representations instead of explicit text tokens. A lightweight external decoder supplies step-level supervision in forward mode (decoding latents to next-step text) and backward mode (aligning decoder hidden states to latents given prior text), supplemented by an internal coherence loss on consecutive latent states; both decoder and losses are removed at inference. Experiments on eight benchmarks claim outperformance over CODI, SIM-CoT, and auxiliary-image latent methods, plus 10.1× inference-time and 22.6× text-decoding-time reductions versus text CoT baselines.
Significance. If the central claim holds—that the learned latent chains encode usable reasoning content that survives removal of the external decoder—this would represent a meaningful advance in efficient multi-modal reasoning by sidestepping the token overhead and expressivity limits of text CoT. The public code release is a positive factor for reproducibility.
major comments (3)
- [Method section] Method section (description of forward/backward decoder and coherence loss): no ablation or controlled experiment is described that isolates the latent states from decoder gradients (e.g., by freezing or removing the decoder during training and measuring downstream performance). This is load-bearing for the claim that gains derive from genuine latent reasoning rather than decoder-dependent artifacts.
- [Experiments] Experiments / results: the manuscript asserts stable training and performance gains but supplies no quantitative metrics on training stability (loss curves, variance across seeds), decoder-mode ablations, or error analysis of the 3-step latent chains. Without these, it is impossible to verify that the external supervision reliably prevents meaningless latent semantics.
- [Results] Results section: no probing, nearest-neighbor analysis, or intervention study is reported that directly tests whether the latent representations remain semantically functional once the decoder is stripped at inference. This evidence gap directly affects the strongest claim that CoLT outperforms prior latent methods because of latent-space reasoning.
minor comments (2)
- [Abstract] Abstract: the reported speed-up factors (10.1×, 22.6×) would be more informative if accompanied by the exact baseline configurations and hardware used.
- [Method section] Notation: the distinction between “latent thought representations” and the model’s internal hidden states could be clarified with an explicit diagram or equation in the method section.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting areas where additional evidence would strengthen the claims regarding latent-space reasoning. We address each major comment below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Method section] Method section (description of forward/backward decoder and coherence loss): no ablation or controlled experiment is described that isolates the latent states from decoder gradients (e.g., by freezing or removing the decoder during training and measuring downstream performance). This is load-bearing for the claim that gains derive from genuine latent reasoning rather than decoder-dependent artifacts.
Authors: We agree that an ablation isolating the contribution of decoder gradients is necessary to support the claim of genuine latent reasoning. In the revised manuscript we will add a controlled experiment in which the external decoder is either removed entirely or its parameters are frozen after an initial warm-up phase, with downstream benchmark performance reported for comparison against the full CoLT training procedure. This will quantify the extent to which the learned latents depend on ongoing decoder supervision versus functioning independently. revision: yes
-
Referee: [Experiments] Experiments / results: the manuscript asserts stable training and performance gains but supplies no quantitative metrics on training stability (loss curves, variance across seeds), decoder-mode ablations, or error analysis of the 3-step latent chains. Without these, it is impossible to verify that the external supervision reliably prevents meaningless latent semantics.
Authors: We acknowledge the absence of these quantitative diagnostics. The revised manuscript will include (i) training loss curves for both forward and backward modes, (ii) performance variance across at least three random seeds, (iii) an ablation comparing forward-only, backward-only, and combined decoder supervision, and (iv) a qualitative error analysis of the 3-step latent chains on representative failure cases from the benchmarks. revision: yes
-
Referee: [Results] Results section: no probing, nearest-neighbor analysis, or intervention study is reported that directly tests whether the latent representations remain semantically functional once the decoder is stripped at inference. This evidence gap directly affects the strongest claim that CoLT outperforms prior latent methods because of latent-space reasoning.
Authors: The primary support for post-decoder semantic functionality is the consistent outperformance over CODI, SIM-CoT, and auxiliary-image baselines on eight benchmarks after decoder removal. Nevertheless, we agree that direct tests would be more conclusive. In revision we will add a nearest-neighbor analysis of the learned latent states against text embeddings of reasoning steps, together with a simple intervention study that perturbs individual latent vectors and measures the effect on final answer accuracy. revision: yes
Circularity Check
No significant circularity; method is an independent training procedure evaluated on external benchmarks.
full rationale
The paper describes a training procedure that uses an external lightweight decoder (forward/backward modes) plus internal coherence losses to supervise latent-state transitions, then removes the decoder at inference. No equations, fitted parameters, or self-citations are presented that would make the reported gains or latent-reasoning claims equivalent to the inputs by construction. Performance is measured on eight external benchmarks against baselines (CODI, SIM-CoT, text CoT), satisfying the criteria for a self-contained, non-circular derivation. The skeptic concern about decoder-dependent artifacts is a correctness/empirical-validity issue, not a circularity reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Threshold-Guided Optimization for Visual Generative Models
Bai, J., Lei, Y., Shi, Q., Feng, A., Xin, Y., Zhao, Z., Shen, F., Yu, K., Li, J.: Threshold-guided optimization for visual generative models. arXiv preprint arXiv:2605.04653 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Bai, J., Li, Y., Zhu, Y., Xin, Y., Shi, Q., Feng, A., Liu, X., Tao, M., Xue, J., Li, X., et al.: Prism: Efficient test-time scaling via hierarchical search and self-verification for discrete diffusion language models. arXiv preprint arXiv:2602.01842 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Wang, J., Qiao, Y., Lin, D., et al.: Are we on the right way for evaluating large vision- language models? Advances in Neural Information Processing Systems37, 27056– 27087 (2024)
2024
-
[5]
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Chen, Q., Qin, L., Liu, J., Peng, D., Guan, J., Wang, P., Hu, M., Zhou, Y., Gao, T., Che, W.: Towards reasoning era: A survey of long chain-of-thought for reasoning large language models. arXiv preprint arXiv:2503.09567 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
arXiv preprint arXiv:2505.16782 (2025)
Chen, X., Zhao, A., Xia, H., Lu, X., Wang, H., Chen, Y., Zhang, W., Wang, J., Li, W., Shen, X.: Reasoning beyond language: A comprehensive survey on latent chain-of-thought reasoning. arXiv preprint arXiv:2505.16782 (2025)
-
[7]
Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Compressed Chain of Thought: Efficient Reasoning Through Dense Representations
Cheng, J., Van Durme, B.: Compressed chain of thought: Efficient reasoning through dense representations. arXiv preprint arXiv:2412.13171 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Implicit chain of thought reasoning via knowledge distillation
Deng, Y., Prasad, K., Fernandez, R., Smolensky, P., Chaudhary, V., Shieber, S.: Implicit chain of thought reasoning via knowledge distillation. arXiv preprint arXiv:2311.01460 (2023)
-
[10]
OneThinker: All-in-one Reasoning Model for Image and Video
Feng, K., Zhang, M., Li, H., Fan, K., Chen, S., Jiang, Y., Zheng, D., Sun, P., Zhang, Y., Sun, H., et al.: Onethinker: All-in-one reasoning model for image and video. arXiv preprint arXiv:2512.03043 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Feng, K., Zhang, M., Li, H., Fan, K., Chen, S., Jiang, Y., Zheng, D., Sun, P., Zhang, Y., Sun, H., et al.: Onethinker: All-in-one reasoning model for image and video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5432–5443 (2026)
2026
-
[12]
Fu, R., Li, Y., Zhang, Z., Wu, J., Liu, Y., Cao, S., Zeng, Y., Zhang, Y., Du, X., Fong, S.: Neurosymactive: Differentiable neural-symbolic reasoning with active ex- plorationforknowledgegraphquestionanswering.arXivpreprintarXiv:2602.15353 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
In: Proceedings of the ACM Web Conference
Fu, R., Wang, Y., Xu, T., Liu, Y., Tang, W., Wu, W., Ma, X., Fong, S.: S-path- rag: Semantic-aware shortest-path retrieval augmented generation for multi-hop knowledge graph question answering. In: Proceedings of the ACM Web Conference
-
[14]
4057–4068 (2026)
pp. 4057–4068 (2026)
2026
-
[15]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Training Large Language Models to Reason in a Continuous Latent Space
Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., Tian, Y.: Train- ing large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769 (2024) CoLT: Chain of Latent Thoughts 17
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Advances in Neural Information Processing Systems (2025)
He, Y., Zheng, W., Zhu, Y., Zheng, Z., Su, L., Vasudevan, S., Guo, Q., Hong, L., Li, J.: Semcot: Accelerating chain-of-thought reasoning through semantically-aligned implicit tokens. Advances in Neural Information Processing Systems (2025)
2025
-
[18]
International Conference on Learning Representation (2026)
Hu, L., Gao, L., Shang, F., Wan, L., Feng, W.: illava: An image is worth fewer than 1/3 input tokens in large multimodal models. International Conference on Learning Representation (2026)
2026
-
[19]
International Conference on Machine Learning (2026)
Hu, L., Ma, X., Liao, Z., Liu, Y.: Tvi-cot: Text-visual interleaved chain-of-thought reasoning for multimodal understanding. International Conference on Machine Learning (2026)
2026
-
[20]
arXiv preprint arXiv:2601.09668 (2026)
Huang, A., Yao, C., Han, C., Wan, F., Guo, H., Lv, H., Zhou, H., Wang, J., Zhou, J., Sun, J., et al.: Step3-vl-10b technical report. arXiv preprint arXiv:2601.09668 (2026)
-
[21]
Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al.: Openai o1 system card. arXiv preprint arXiv:2412.16720 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
In: European conference on computer vision
Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A di- agram is worth a dozen images. In: European conference on computer vision. pp. 235–251. Springer (2016)
2016
-
[23]
International Conference on Learning Repre- sentation (2026)
Li, B., Sun, X., Liu, J., Wang, Z., Wu, J., Yu, X., Chen, H., Barsoum, E., Chen, M., Liu, Z.: Latent visual reasoning. International Conference on Learning Repre- sentation (2026)
2026
-
[24]
LLaVA-OneVision: Easy Visual Task Transfer
Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Li, B., Ge, Y., Ge, Y., Wang, G., Wang, R., Zhang, R., Shan, Y.: Seed- bench: Benchmarking multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13299– 13308 (2024)
2024
-
[26]
Computational Visual Media 10(4), 741–752 (2024)
Li, J., Huang, Y., Wu, M., Zhang, B., Ji, X., Zhang, C.: Clip-sp: Vision-language model with adaptive prompting for scene parsing. Computational Visual Media 10(4), 741–752 (2024)
2024
-
[27]
In: The Twelfth International Conference on Learning Representations (2024)
Li, Z., Liu, H., Zhou, D., Ma, T.: Chain of thought empowers transformers to solve inherently serial problems. In: The Twelfth International Conference on Learning Representations (2024)
2024
-
[28]
Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llavanext: Improved reasoning, ocr, and world knowledge (2024)
2024
-
[29]
Advances in neural information processing systems36, 34892–34916 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)
2023
-
[30]
Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: European conference on computer vision. pp. 216–233. Springer (2024)
2024
-
[31]
Decoupled Weight Decay Regularization
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[32]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Advances in Neural Information Processing Systems 35, 2507–2521 (2022) 18 Lianyu Hu et al
Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems 35, 2507–2521 (2022) 18 Lianyu Hu et al
2022
-
[34]
PhotoCraft: Agentic Reasoning with Hierarchical Self-Evolving Memory for Deep Image Search
Lyu, K., Yuan, Z., He, J., Yan, Q., Su, X., Hu, N., Liu, Y., Hao, C., Qin, S., Hu, L., et al.: Photocraft: Agentic reasoning with hierarchical self-evolving memory for deep image search. arXiv preprint arXiv:2606.03099 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[35]
In: Findings of the association for computational linguistics: ACL 2022
Masry, A., Tan, J.Q., Joty, S., Hoque, E., et al.: Chartqa: A benchmark for ques- tion answering about charts with visual and logical reasoning. In: Findings of the association for computational linguistics: ACL 2022. pp. 2263–2279 (2022)
2022
-
[36]
Advances in Neural Information Processing Systems37, 8612–8642 (2024)
Shao, H., Qian, S., Xiao, H., Song, G., Zong, Z., Wang, L., Liu, Y., Li, H.: Vi- sual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems37, 8612–8642 (2024)
2024
-
[37]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Shen, Z., Yan, H., Zhang, L., Hu, Z., Du, Y., He, Y.: Codi: Compressing chain- of-thought into continuous space via self-distillation. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 677–693 (2025)
2025
-
[39]
Visual Intelligence4(1), 14 (2026)
Shi, H., Liu, W., Li, Z., Fang, X., Meng, X., Peng, W., Zhong, H., Liu, M., Wang, Y.: Intelligent robot systems: a survey from the perspective of visual intelligence. Visual Intelligence4(1), 14 (2026)
2026
-
[40]
Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[41]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Singh, A., Natarjan, V., Shah, M., Jiang, Y., Chen, X., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8317–8326 (2019)
2019
-
[42]
arXiv preprint arXiv:2510.23925 (2025)
Sun, G., Hua, H., Wang, J., Luo, J., Dianat, S., Rabbani, M., Rao, R., Tao, Z.: La- tent chain-of-thought for visual reasoning. arXiv preprint arXiv:2510.23925 (2025)
-
[43]
arXiv preprint arXiv:2505.16552 (2025)
Tan, W., Li, J., Ju, J., Luo, Z., Song, R., Luan, J.: Think silently, think fast: Dy- namic latent compression of llm reasoning chains. arXiv preprint arXiv:2505.16552 (2025)
-
[44]
Team, Q.: Qwq: Reflect deeply on the boundaries of the unknown (2024)
2024
-
[45]
Computational Visual Media11(1), 1–28 (2025)
Wang, C., Peng, H.Y., Liu, Y.T., Gu, J., Hu, S.M.: Diffusion models for 3d gener- ation: A survey. Computational Visual Media11(1), 1–28 (2025)
2025
-
[46]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
IEEE/CVF Conference on Computer Vision and Pattern Recognition (2026)
Wang, Q., Shi, Y., Wang, Y., Zhang, Y., Wan, P., Gai, K., Ying, X., Wang, Y.: Monet: Reasoning in latent visual space beyond images and language. IEEE/CVF Conference on Computer Vision and Pattern Recognition (2026)
2026
-
[48]
International Conference on Learning Representation (2023)
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. International Conference on Learning Representation (2023)
2023
-
[49]
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Wang, Y., Wu, S., Zhang, Y., Yan, S., Liu, Z., Luo, J., Fei, H.: Multimodal chain- of-thought reasoning: A comprehensive survey. arXiv preprint arXiv:2503.12605 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Advances in neural information processing systems35, 24824–24837 (2022) CoLT: Chain of Latent Thoughts 19
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022) CoLT: Chain of Latent Thoughts 19
2022
-
[51]
International Conference on Learning Representation (2026)
Wei, X., Liu, X., Zang, Y., Dong, X., Cao, Y., Wang, J., Qiu, X., Lin, D.: Sim- cot: Supervised implicit chain-of-thought. International Conference on Learning Representation (2026)
2026
-
[52]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Xiao, J., Wu, Z., Lin, H., Chen, Y., Liu, Y., Zhao, X., Wang, Z., He, Z.: Not just what’sthere:Enablingcliptocomprehendnegatedvisualdescriptionswithoutfine- tuning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 40, pp. 10978–10986 (2026)
2026
-
[53]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Xu, G., Jin, P., Wu, Z., Li, H., Song, Y., Sun, L., Yuan, L.: Llava-cot: Let vision lan- guage models reason step-by-step. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2087–2098 (2025)
2087
-
[54]
In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xu, Y., Guo, X., Zeng, Z., Miao, C.: Softcot: Soft chain-of-thought for efficient reasoning with llms. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 23336–23351 (2025)
2025
-
[55]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
5-xcoder: Multi-agent collaboration for multilingual code instruction tuning
Yang, J., Zhang, W., Miao, Y., Quan, S., Wu, Z., Peng, Q., Yang, L., Liu, T., Cui, Z., Hui, B., et al.: Qwen2. 5-xcoder: Multi-agent collaboration for multilingual code instruction tuning. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 13121–13131 (2025)
2025
-
[57]
arXiv preprint arXiv:2510.01623 (2025)
Ye, A., Zhang, Z., Wang, B., Wang, X., Zhang, D., Zhu, Z.: Vla-r1: Enhancing rea- soning in vision-language-action models. arXiv preprint arXiv:2510.01623 (2025)
-
[58]
arXiv preprint arXiv:2404.16006 (2024)
Ying, K., Meng, F., Wang, J., Li, Z., Lin, H., Yang, Y., Zhang, H., Zhang, W., Lin, Y., Liu, S., et al.: Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi. arXiv preprint arXiv:2404.16006 (2024)
-
[59]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9556– 9567 (2024)
2024
-
[60]
GLM-5: from Vibe Coding to Agentic Engineering
Zeng, A., Lv, X., Hou, Z., Du, Z., Zheng, Q., Chen, B., Yin, D., Ge, C., Xie, C., Wang, C., et al.: Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[61]
In: Proceedings of the 33rd ACM International Confer- ence on Multimedia
Zhang, S., Hao, X., Tang, Y., Zhang, L., Wang, P., Wang, Z., Ma, H., Zhang, S.: Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of-thought. In: Proceedings of the 33rd ACM International Confer- ence on Multimedia. pp. 12745–12752 (2025)
2025
-
[62]
Multimodal Chain-of-Thought Reasoning in Language Models
Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain- of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Zhao, Q., Lu, Y., Kim, M.J., Fu, Z., Zhang, Z., Wu, Y., Li, Z., Ma, Q., Han, S., Finn, C., et al.: Cot-vla: Visual chain-of-thought reasoning for vision-language- action models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1702–1713 (2025)
2025
-
[64]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Zhu, C., Lin, Y., Chen, S., Wang, Y., Lin, J.: Medeyes: Learning dynamic visual focus for medical progressive diagnosis. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 40, pp. 13916–13924 (2026)
2026
-
[65]
In: Proceedings of the 33rd ACM International Conference on Multimedia
Zhu, C., Lin, Y., Shao, J., Lin, J., Wang, Y.: Pathology-aware prototype evolution via llm-driven semantic disambiguation for multicenter diabetic retinopathy diag- nosis. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 9196–9205 (2025) 20 Lianyu Hu et al
2025
-
[66]
Zhu, C., Zeng, J., Jiang, J., Lin, J., Wang, Y.: Medsynapse-v: Bridging visual perception and clinical intuition via latent memory evolution (2026),https:// arxiv.org/abs/2604.26283
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[67]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.