arxiv: 2605.02735 · v1 · submitted 2026-05-04 · 💻 cs.LG

Recognition: unknown

Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

Jiawei Du, Joey Tianyi Zhou, Moyun Liu, Qiqi Tao, Xin Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 15:44 UTC · model grok-4.3

classification 💻 cs.LG

keywords visual latentslatent reasoningmultimodal large language modelsinference-time optimizationsilenced latentsautoregressive suppressionchain-of-thought alternativevisual question answering

0 comments

The pith

Visual latents in multimodal models contain richer reasoning than they contribute because training suppresses their role in favor of direct visual shortcuts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that visual latents in MLLMs become semantically enriched yet contribute little to final predictions, as the autoregressive objective favors shortcut reliance on direct visual input and pushes latents into low-information transition states. This creates an untapped reasoning capacity that remains unused during standard inference. To fix it, the authors optimize the latents directly at inference time with the backbone frozen: first through query-guided contrastive alignment to improve semantic quality, then via a confidence-progression reward that makes token distributions along the latent span progressively more concentrated. A reader would care because this offers a way to access higher-dimensional visual reasoning without retraining or adding explicit tokens.

Core claim

Although visual latents grow semantically rich during training, the autoregressive objective systematically suppresses their contribution to answer prediction by driving them toward transition-like states rather than informative content. Disentangling the objectives at inference time through two stages—query-guided contrastive latent-visual alignment followed by confidence-progression reward optimization—routes predictions through the latent reasoning path instead of bypassing it, unleashing suppressed capacity without any parameter updates.

What carries the argument

The Silenced Visual Latents phenomenon, in which training dynamics suppress latent reasoning contribution, addressed by inference-time two-stage optimization of the latent tokens themselves.

If this is right

Latent reasoning becomes a compact high-dimensional alternative to textual chain-of-thought for integrating visual evidence.
Performance improves across eight benchmarks and four model backbones with no weight updates required.
Predictions can be systematically routed through enriched visual latents instead of direct input pathways.
Conflicting objectives of semantic enrichment and prediction contribution can be separated by freezing the backbone and tuning only latents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar suppression of internal states may occur in non-visual reasoning paths within the same models or in pure language models.
The approach could be tested on other multimodal tasks such as visual captioning to check if latent routing yields more coherent outputs.
Combining inference-time latent tuning with lightweight adapters might amplify gains while keeping most parameters fixed.
If the effect is general, current inference practices in multimodal systems may routinely underuse available internal representations.

Load-bearing premise

The autoregressive training objective pushes latent tokens into uninformative states, and the two inference stages can reliably force the model to route its prediction through those latents rather than falling back to direct visual input.

What would settle it

Applying the two-stage inference optimization to a standard visual reasoning benchmark and observing neither accuracy gains nor measurable increases in progressive token confidence along the latent span would show the suppression cannot be unsilenced this way.

Figures

Figures reproduced from arXiv: 2605.02735 by Jiawei Du, Joey Tianyi Zhou, Moyun Liu, Qiqi Tao, Xin Zhang.

**Figure 1.** Figure 1: The joint loss landscape of Latent Visual Reasoning. Under joint optimization, the initial latents are simultaneously pulled toward two conflicting attractors: the autoregressive prediction objective (left) and the visual reasoning objective (right). The shortcut favored by the autoregressive prediction objective dissociates latent quality from latent effectiveness and bypasses meaningful latent reasonin… view at source ↗

**Figure 2.** Figure 2: Enhanced visual latents improve reasoning performance, but this benefit is suppressed under joint optimization. Left: Visual latents are progressively enhanced during training, as evidenced by the deepening red color indicating higher similarity between latents and pre-curated visual clues (a), while the alignment loss decreases steadily (b). Right: Although visual latents continue to improve, the jointly … view at source ↗

**Figure 3.** Figure 3: Joint optimization silences visual latents. (a) During training, attention gradually drifts toward the input visual tokens rather than the visual latents, indicating that visual latents progressively lose their voice in answer prediction. (b) Prediction logits of the first latent token at different training stages. The latent token is increasingly pushed toward the <latent_end> token, indicating by the ove… view at source ↗

**Figure 4.** Figure 4: Overview of the proposed framework for unsilencing visual latents. Stage I performs query-guided contrastive latent–visual alignment with positive and negative patch sets to enhance latent semantics. Stage II initializes from the Stage-I latents and applies a confidence-progression reward that encourages progressively more concentrated output token distributions, yielding the final latents H∗ for answer ge… view at source ↗

**Figure 5.** Figure 5: Ablation studies on Hull-Bench. (a,b) Effects of the Stage-I and Stage-II optimization steps, Nsft and Nrl, respectively. Gray bars denote the baseline without the corresponding component, and darker bars denote the final setting; we use Nsft = 5 and Nrl = 15 for the accuracy–efficiency trade-off. (c,d) Effects of the negative and positive patch numbers, with pos_num=2 and neg_num=4 fixed, respectively. Th… view at source ↗

**Figure 6.** Figure 6: Attention visualization on an MMVP sample. Optimized latents receive more focused attention than visual inputs, reducing shortcuts compared with view at source ↗

**Figure 7.** Figure 7: Efficiency ratio of different methods on MMVP. Higher values indicate a better performance-efficiency balance. Calculation details are provided in Appendix B.3. Attention Visualization. We visualize the attention patterns on an MMVP sample in view at source ↗

**Figure 8.** Figure 8: GPT-4o judging prompt used for answer evaluation. The prompt asks the evaluator to view at source ↗

**Figure 9.** Figure 9: More attention visualization. B.3 Efficiency Ratio Calculation We define the efficiency ratio to jointly measure the performance gain and token efficiency of each method. Specifically, let ∆m denote the average accuracy improvement of method m over the vanilla baseline across all benchmarks, and let Tm denote the average number of output tokens generated per sample. The efficiency ratio is defined as: Effi… view at source ↗

read the original abstract

Continuous latent-space reasoning offers a compact alternative to textual chain-of-thought for multimodal models, enabling high-dimensional visual evidence to be integrated without explicit reasoning tokens. However, we identify a previously overlooked optimization pathology in existing latent visual reasoning methods: although visual latents become semantically enriched during training, their contribution to final answer prediction is systematically suppressed. Within the shared parameter space, the autoregressive objective favors shortcut reliance on direct visual input, driving latent tokens toward transition-like states rather than informative reasoning content. We term this phenomenon Silenced Visual Latents. To address it, we disentangle the two conflicting objectives by directly optimizing the latent reasoning at inference time, keeping backbone parameters frozen. In Stage I, visual latents are warmed up via query-guided contrastive latent--visual alignment, improving semantic quality while preventing latent collapse. In Stage II, the latent reasoning is further optimized via a confidence-progression reward, which incentivizes predicted token distributions along the latent span to become progressively more concentrated, routing predictions through the latent reasoning rather than bypassing it. Experiments across eight benchmarks and four model backbones show that inference-time latent optimization, without any parameter updates, effectively unleashes the suppressed reasoning capacity of visual latents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper identifies a suppression of visual latents in MLLMs and proposes a training-free two-stage inference fix that reports gains on eight benchmarks, but lacks direct tests confirming the gains come from routing through latents rather than general alignment.

read the letter

The main point is a practical inference-time method to make visual latents contribute more to final predictions in multimodal models. The authors name the issue Silenced Visual Latents, where autoregressive training pushes latents into transition states instead of carrying reasoning, and they fix it without updating weights by first doing query-guided contrastive alignment then applying a confidence-progression reward along the latent span. They test this on four backbones and eight benchmarks and claim it unleashes suppressed capacity. That setup is straightforward and training-free, which is useful for anyone who wants to try it on existing models without new compute. The contrastive stage plus the reward that pushes token distributions to concentrate is a clear procedural addition over earlier latent reasoning work. The experiments span multiple models, which gives the results some breadth. The soft spot is the missing isolation. Nothing in the description shows that predictions actually shift to depend on the latent tokens after the stages run; the lifts could come from the alignment step simply improving how visuals match the query, without the latents doing reasoning work. No attention maps, no latent ablation, no comparison to the same optimization applied only to visual features. The abstract also gives no numbers, error bars, or ablation tables, so the size of the effect and its robustness stay unclear until the full results section is checked. This is for people working on efficient multimodal reasoning who prefer latent spaces over explicit chains of thought and want methods that avoid retraining. A reader trying to improve existing MLLMs at test time could extract a usable trick if the full paper supplies the missing controls. The work is coherent enough on its own terms to go to peer review, though any referee will ask for mechanistic checks on whether the latents are truly being used. I would send it out with that request rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper claims that visual latents in MLLMs are systematically suppressed ('Silenced Visual Latents') by the autoregressive objective during training, which favors direct visual shortcuts and drives latents to transition-like states. It proposes a parameter-free, inference-time two-stage optimization—Stage I (query-guided contrastive latent-visual alignment) and Stage II (confidence-progression reward)—to improve latent semantic quality and route predictions through the latent span. Experiments across eight benchmarks and four model backbones show performance gains without any parameter updates, arguing that this unleashes suppressed reasoning capacity in visual latents.

Significance. If the results and mechanistic interpretation hold, the work would be significant for multimodal learning by demonstrating a practical inference-time method to enhance latent reasoning without retraining. It identifies a potential optimization pathology in existing latent visual reasoning approaches and reports gains on multiple benchmarks and backbones, offering an efficient alternative to textual chain-of-thought or full fine-tuning.

major comments (2)

[§5] The central claim that gains arise specifically from routing predictions through the latent span (rather than improved visual-query alignment alone) is load-bearing but unsupported by isolating evidence. No attention/gradient attribution to latent tokens, intervention experiments, or comparison against equivalent optimization applied only to visual features is described, leaving alternative explanations for the benchmark lifts unaddressed.
[Abstract and §5] The abstract and results report gains on eight benchmarks but provide no quantitative values, error bars, ablation controls on Stage I vs. Stage II, or data exclusion rules. This prevents assessment of effect sizes and robustness, which is necessary to substantiate the claim that the stages reverse the silencing pathology.

minor comments (2)

[§1] The introduction of 'Silenced Visual Latents' as a new term would benefit from a formal mathematical characterization or precise definition early in the paper to distinguish it from related concepts like latent collapse.
Figure and table captions should explicitly state the number of runs, random seeds, and statistical tests used to support the reported improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for strengthening the evidence and presentation of our results. We address each major comment below and commit to revisions that improve the manuscript without altering its core claims.

read point-by-point responses

Referee: [§5] The central claim that gains arise specifically from routing predictions through the latent span (rather than improved visual-query alignment alone) is load-bearing but unsupported by isolating evidence. No attention/gradient attribution to latent tokens, intervention experiments, or comparison against equivalent optimization applied only to visual features is described, leaving alternative explanations for the benchmark lifts unaddressed.

Authors: We agree that isolating the contribution of latent-span routing is essential to support the silencing pathology interpretation. The two-stage design separates query-guided alignment (Stage I) from the confidence-progression reward (Stage II), and the manuscript reports incremental gains from adding Stage II. However, we did not include intervention experiments, gradient attributions on latent tokens, or direct comparisons to equivalent optimization applied only to visual features. In the revision we will add these ablations and analyses to provide the requested isolating evidence. revision: yes
Referee: [Abstract and §5] The abstract and results report gains on eight benchmarks but provide no quantitative values, error bars, ablation controls on Stage I vs. Stage II, or data exclusion rules. This prevents assessment of effect sizes and robustness, which is necessary to substantiate the claim that the stages reverse the silencing pathology.

Authors: We acknowledge that the abstract summarizes gains at a high level without specific numbers or error bars, and that clearer presentation of ablations and data rules would aid assessment. The full experiments section contains tables with performance metrics across the eight benchmarks and four backbones, along with Stage I vs. Stage II ablations; however, we will revise the abstract to include representative quantitative results with error bars, ensure all ablations are explicitly labeled with data exclusion criteria, and add any missing robustness details. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation chain is self-contained

full rationale

The paper's central contribution is an empirical inference-time procedure (Stage I query-guided contrastive alignment followed by Stage II confidence-progression reward) applied to frozen backbones. These stages are defined directly from the stated objectives of alignment and progressive concentration of token distributions, without any equations that reduce reported benchmark gains to quantities fitted on the same test data or to self-referential definitions of the 'silenced' phenomenon. The identification of the autoregressive pathology is presented as an observational claim rather than a derived theorem, and no load-bearing self-citations or uniqueness results are invoked in the provided text to force the method. Experiments on external benchmarks therefore stand as independent measurements rather than tautological outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The claim rests on the empirical observation of latent suppression and the effectiveness of the two inference stages; no free parameters, mathematical axioms, or new physical entities are introduced.

invented entities (1)

Silenced Visual Latents no independent evidence
purpose: Names the observed suppression of latent reasoning content during training
New descriptive term coined in the abstract to label the pathology; no independent evidence provided beyond the reported experiments.

pith-pipeline@v0.9.0 · 5525 in / 1108 out tokens · 27841 ms · 2026-05-09T15:44:19.547258+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 15 canonical work pages · 3 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report, 2025.URL https://arxiv. org/abs/2502.13923, 6:13–23, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Qwen2.5-vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025

2025
[3]

Univg-r1: Reasoning guided universal visual grounding with reinforcement learning.arXiv preprint arXiv:2505.14231, 2025

Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, and Yansong Tang. Univg-r1: Reasoning guided universal visual grounding with reinforcement learning.arXiv preprint arXiv:2505.14231, 2025

work page arXiv 2025
[4]

Sft or rl? an early investigation into training r1-like reasoning large vision-language models, 2025

Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models, 2025

2025
[5]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024

work page internal anchor Pith review arXiv 2024
[6]

Think with 3d: Geometric imagination grounded spatial reasoning from limited views

Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Yan Feng, Peng Pei, Xunliang Cai, and Ruqi Huang. Think with 3d: Geometric imagination grounded spatial reasoning from limited views. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

2026
[7]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

2024
[8]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive.arXiv preprint arXiv:2404.12390, 2024

work page arXiv 2024
[9]

Refocus: Visual editing as a chain of thought for structured image understanding

Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Richard Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding. InInternational Conference on Machine Learning, pages 17783–17805. PMLR, 2025

2025
[10]

Omni-MATH: A universal olympiad level mathematic benchmark for large language models

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-MATH: A universal olympiad level mathematic benchmark for large language models. InThe Thirteenth ...

2025
[11]

Interleaved-modal chain-of-thought, 2025

Jun Gao, Yongqi Li, Ziqiang Cao, and Wenjie Li. Interleaved-modal chain-of-thought, 2025

2025
[12]

Hallusion- bench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models, 2023

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusion- bench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models, 2023

2023
[13]

Training large language models to reason in a continuous latent space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason E Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space. InSecond Conference on Language Modeling
[14]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024. 11

2024
[15]

Latent visual reasoning

Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning. InInternational Conference on Learning Representations, 2026

2026
[16]

Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

Chengzhi Liu, Yuzhe Yang, Yue Fan, Qingyue Wei, Sheng Liu, and Xin Eric Wang. Rea- soning within the mind: Dynamic multimodal interleaving in latent space.arXiv preprint arXiv:2512.12623, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Deliberation in latent space via differentiable cache augmentation

Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, and Arthur Szlam. Deliberation in latent space via differentiable cache augmentation. InF orty-second International Conference on Machine Learning
[18]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conference on Learning Representations (ICLR), 2024

2024
[19]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InThe 36th Conference on Neural Information Process- ing Systems (NeurIPS), 2022

2022
[20]

A survey on latent reasoning.arxiv, 2025

M-A-P. A survey on latent reasoning.arxiv, 2025

2025
[21]

Cocova: Chain of continuous vision- language thought for latent space reasoning.arXiv preprint arXiv:2511.02360, 2025

Jizheng Ma, Xiaofei Zhou, Geyuan Zhang, Yanlong Song, and Han Yan. Multimodal reasoning via latent refocusing.arXiv preprint arXiv:2511.02360, 2025

work page arXiv 2025
[22]

Compositional chain-of- thought prompting for large multimodal models

Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of- thought prompting for large multimodal models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14420–14431, June 2024

2024
[23]

Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens.arXiv preprint arXiv:2511.19418, 2025

Yiming Qin, Bomin Wei, Jiaxin Ge, Konstantinos Kallidromitis, Stephanie Fu, Trevor Darrell, and XuDong Wang. Chain-of-visual-thought: Teaching vlms to see and think better with continuous visual tokens.arXiv preprint arXiv:2511.19418, 2025

work page arXiv 2025
[24]

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024

2024
[25]

Codi: Com- pressing chain-of-thought into continuous space via self-distillation

Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: Com- pressing chain-of-thought into continuous space via self-distillation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 677–693, 2025

2025
[26]

Swireasoning: Switch-thinking in latent and explicit for pareto-superior reasoning LLMs

Dachuan Shi, Abedelkadir Asi, Keying Li, Xiangchi Yuan, Leyan Pan, Wenke Lee, and Wen Xiao. Swireasoning: Switch-thinking in latent and explicit for pareto-superior reasoning LLMs. InThe F ourteenth International Conference on Learning Representations, 2026

2026
[27]

Think silently, think fast: Dynamic latent compression of llm reasoning chains.Advances in Neural Information Processing Systems, 2025

Wenhui Tan, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Jian Luan, and Ruihua Song. Think silently, think fast: Dynamic latent compression of llm reasoning chains.Advances in Neural Information Processing Systems, 2025

2025
[28]

Visual position prompt for mllm based visual grounding.IEEE Transactions on Multimedia, 2026

Wei Tang, Yanpeng Sun, Qinying Gu, and Zechao Li. Visual position prompt for mllm based visual grounding.IEEE Transactions on Multimedia, 2026

2026
[29]

Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024

2024
[30]

MLLM can see? dynamic correction decoding for hallucination mitigation

Chenxi Wang, Xiang Chen, Ningyu Zhang, Bozhong Tian, Haoming Xu, Shumin Deng, and Huajun Chen. MLLM can see? dynamic correction decoding for hallucination mitigation. In The Thirteenth International Conference on Learning Representations, 2025

2025
[31]

Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025

Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, and Yisen Wang. Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025. 12

work page arXiv 2025
[32]

Image tokens matter: Mitigating hallucination in discrete tokenizer-based large vision-language models via latent editing.arXiv preprint arXiv:2505.21547, 2025

Weixing Wang, Zifeng Ding, Jindong Gu, Rui Cao, Christoph Meinel, Gerard de Melo, and Haojin Yang. Image tokens matter: Mitigating hallucination in discrete tokenizer-based large vision-language models via latent editing.arXiv preprint arXiv:2505.21547, 2025

work page arXiv 2025
[33]

Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025

Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025

work page arXiv 2025
[34]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022
[35]

Deepscientist: Advancing frontier-pushing scientific findings progressively

Yixuan Weng, Minjun Zhu, Qiujie Xie, QiYao Sun, Zhen Lin, Sifan Liu, and Yue Zhang. Deepscientist: Advancing frontier-pushing scientific findings progressively. InThe F ourteenth International Conference on Learning Representations, 2026

2026
[36]

Mind’s eye of llms: visualization-of-thought elicits spatial reasoning in large language models

Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei. Mind’s eye of llms: visualization-of-thought elicits spatial reasoning in large language models. Advances in Neural Information Processing Systems, 37:90277–90317, 2024

2024
[37]

Mini-omni-reasoner: Token-level thinking-in-speaking in large speech models.arXiv preprint, 2025

Zhifei Xie, Ziyang Ma, Zihang Liu, Kaiyu Pang, Hongyu Li, Jialin Zhang, Yue Liao, Deheng Ye, Chunyan Miao, and Shuicheng Yan. Mini-omni-reasoner: Token-level thinking-in-speaking in large speech models.arXiv preprint, 2025

2025
[38]

Thinking in uncertainty: Mitigating hallucinations in mlrms with latent entropy-aware decoding.arXiv preprint arXiv:2603.13366, 2026

Zhongxing Xu, Zhonghua Wang, Zhe Qian, Dachuan Shi, Feilong Tang, Ming Hu, Shiyan Su, Xiaocheng Zou, Wei Feng, Dwarikanath Mahapatra, Yifan Peng, Mingquan Lin, and Zongyuan Ge. Thinking in uncertainty: Mitigating hallucinations in mlrms with latent entropy-aware decoding.arXiv preprint arXiv:2603.13366, 2026

work page arXiv 2026
[39]

R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization, 2025

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, and Wei Chen. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization, 2025

2025
[40]

Machine mental imagery: Empower multimodal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025

Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. Machine men- tal imagery: Empower multimodal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025

work page arXiv 2025
[41]

Diffusion of thought: Chain-of-thought reasoning in diffusion language models.Advances in Neural Information Processing Systems, 37:105345–105374, 2024

Jiacheng Ye, Shansan Gong, Liheng Chen, Lin Zheng, Jiahui Gao, Han Shi, Chuan Wu, Xin Jiang, Zhenguo Li, Wei Bi, et al. Diffusion of thought: Chain-of-thought reasoning in diffusion language models.Advances in Neural Information Processing Systems, 37:105345–105374, 2024

2024
[42]

A survey on multimodal large language models.National Science Review, 11(12):nwae403, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12):nwae403, 2024

2024
[43]

Mm-cot: a benchmark for probing visual chain-of-thought reasoning in multimodal models.arXiv preprint arXiv:2512.08228, 2025

Jusheng Zhang, Kaitong Cai, Xiaoyang Guo, Sidi Liu, Qinhan Lv, Ruiqi Chen, Jing Yang, Yijia Fan, Xiaofei Sun, Jian Wang, et al. Mm-cot: a benchmark for probing visual chain-of-thought reasoning in multimodal models.arXiv preprint arXiv:2512.08228, 2025

work page arXiv 2025
[44]

Multi- modal chain-of-thought reasoning in language models

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multi- modal chain-of-thought reasoning in language models. 2024

2024
[45]

Promptcot: Synthesizing olympiad- level problems for mathematical reasoning in large language models

Xueliang Zhao, Wei Wu, Jian Guan, and Lingpeng Kong. Promptcot: Synthesizing olympiad- level problems for mathematical reasoning in large language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 18167–18188, 2025

2025
[46]

Image-of-thought prompt- ing for visual reasoning refinement in multimodal large language models.arXiv preprint arXiv:2405.13872, 2024

Qiji Zhou, Ruochen Zhou, Zike Hu, Panzhong Lu, Siyang Gao, and Yue Zhang. Image-of- thought prompting for visual reasoning refinement in multimodal large language models.arXiv preprint arXiv:2405.13872, 2024

work page arXiv 2024
[47]

Intern-s1-pro: Scientific multimodal foundation model at trillion scale, 2026

Yicheng Zou, Dongsheng Zhu, Lin Zhu, Tong Zhu, Yunhua Zhou, Peiheng Zhou, Xinyu Zhou, Dongzhan Zhou, Zhiwang Zhou, Yuhao Zhou, et al. Intern-s1-pro: Scientific multimodal foundation model at trillion scale.arXiv preprint arXiv:2603.25040, 2026. 13 A Detailed Experiments Setting A.1 Benchmarks Counting[ 8]. This benchmark evaluates the MLLM abilities in de...

work page arXiv 2026