Recognition: no theorem link
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
Pith reviewed 2026-05-13 07:22 UTC · model grok-4.3
The pith
Visual replay and depth scaling correct under-optimized visual tokens so latent multimodal reasoning matches explicit CoT accuracy at lower latency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By pairing a visual replay module that reinforces fine-grained visual grounding through saliency-weighted self-attention with adaptive routing depth scaling that assigns deeper processing to complex tokens, the framework converts explicit chain-of-thought steps into efficient latent feature propagation while preserving or improving final task performance.
What carries the argument
The visual replay module combined with routing depth scaling, where the former uses causal self-attention to estimate and reinforce token saliency and the latter allocates variable numbers of latent reasoning steps according to token complexity.
If this is right
- Latent representations can internalize explicit reasoning steps without sacrificing task accuracy on multimodal benchmarks.
- Inference latency drops because explicit token-by-token decoding is replaced by fixed-depth feature propagation.
- Complex tokens receive deeper refinement while simple tokens remain shallow, improving efficiency without uniform depth increases.
- A curriculum that progressively hides explicit CoT traces trains models to rely on compact latent paths.
Where Pith is reading between the lines
- Similar gradient imbalance fixes could help other multimodal settings where one modality is consistently under-optimized during joint training.
- The depth-routing idea might extend to pure language models to allocate compute dynamically to hard tokens rather than using fixed layers.
- If the visual replay module generalizes, it could reduce the need for very large vision encoders in resource-constrained multimodal systems.
Load-bearing premise
The two observations about smaller visual gradient norms and slower convergence of complex tokens must hold as the dominant bottlenecks across the tested models and tasks.
What would settle it
Training runs on the same benchmarks where the proposed modules produce no accuracy gain or speed-up relative to a plain latent baseline would falsify the central claim.
Figures
read the original abstract
Multimodal latent reasoning has emerged as a promising paradigm that replaces explicit Chain-of-Thought (CoT) decoding with implicit feature propagation, simultaneously enhancing representation informativeness and reducing inference latency. By analyzing token-level gradient dynamics during latent training, we reveal two critical observations: (1) visual tokens exhibit significantly smaller gradient norms than their textual counterparts due to inherent language bias, resulting in systematic visual under-optimization; and (2) semantically simple tokens converge rapidly, whereas complex tokens exhibit persistent gradient instability constrained by fixed architectural depths. To address these limitations, we propose a visual replay module and routing depth scaling to collaboratively enhance visual perception and refine complicated latents for deeper contextual reasoning. The former module leverages causal self-attention to estimate token saliency, reinforcing fine-grained grounding through spatially-coherent constraints. Complementarily, the latter mechanism adaptively allocates additional reasoning steps to complex tokens, enabling deeper contextual refinement. Guided by a curriculum strategy that progressively internalizes explicit CoT into compact latent representations, our framework achieves state-of-the-art performance across diverse benchmarks while delivering substantial inference speedups over explicit CoT baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that analyzing token-level gradient dynamics in multimodal latent reasoning reveals two issues—smaller gradient norms for visual tokens due to language bias and persistent instability in complex tokens under fixed depth—and addresses them via a visual replay module (using causal self-attention for saliency estimation and spatially-coherent constraints) plus routing depth scaling (adaptively allocating extra reasoning steps), all under a curriculum that internalizes explicit CoT into latents. This is asserted to yield SOTA performance across benchmarks and substantial inference speedups versus explicit CoT baselines.
Significance. If the gradient observations and module-specific gains are substantiated, the work could advance efficient implicit reasoning in multimodal models by mitigating under-optimization without explicit decoding overhead. The curriculum-guided internalization of CoT is a potentially reusable idea, but the current manuscript supplies no quantitative grounding for the observations or ablations isolating the modules' effects on gradient norms or token stability.
major comments (2)
- [Abstract] Abstract: the two critical observations (visual tokens having significantly smaller gradient norms; complex tokens showing persistent instability) are stated as empirical findings but supplied with no numerical values, figures, tables, or statistical tests. Without these, it is impossible to assess whether the disparity is large enough to be the dominant bottleneck or whether the proposed modules actually correct it rather than adding capacity.
- [Abstract] Abstract: the central claim of 'state-of-the-art performance across diverse benchmarks' and 'substantial inference speedups' is presented without any reported metrics, baseline comparisons, ablation results, or error bars. This renders the performance assertions unverifiable and prevents evaluation of whether gains derive from the targeted mechanisms or from increased effective compute.
minor comments (1)
- [Abstract] The description of the visual replay module states that it 'leverages causal self-attention to estimate token saliency' but does not clarify how this differs from the base model's attention or how the spatially-coherent constraints are implemented and enforced during training.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We agree that it would benefit from greater quantitative specificity and will revise accordingly in the next version. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: [Abstract] Abstract: the two critical observations (visual tokens having significantly smaller gradient norms; complex tokens showing persistent instability) are stated as empirical findings but supplied with no numerical values, figures, tables, or statistical tests. Without these, it is impossible to assess whether the disparity is large enough to be the dominant bottleneck or whether the proposed modules actually correct it rather than adding capacity.
Authors: We acknowledge that the abstract as currently written does not embed the specific numerical values or direct references to supporting figures. The full manuscript contains the gradient-norm analysis in Section 3.1 (including Figure 2 showing per-token gradient magnitudes and Table 1 reporting average norms for visual vs. textual tokens). In the revised version we will add concise quantitative statements to the abstract (e.g., “visual tokens exhibit 3.2× smaller gradient norms on average”) and explicitly cite the relevant figure and table so that the scale of the observed disparity is immediately verifiable. revision: yes
-
Referee: [Abstract] Abstract: the central claim of 'state-of-the-art performance across diverse benchmarks' and 'substantial inference speedups' is presented without any reported metrics, baseline comparisons, ablation results, or error bars. This renders the performance assertions unverifiable and prevents evaluation of whether gains derive from the targeted mechanisms or from increased effective compute.
Authors: The manuscript reports full benchmark tables (Table 3), latency measurements (Table 4), and ablation studies (Section 5) with error bars and explicit baseline comparisons. We agree, however, that the abstract itself is insufficiently concrete. We will revise the abstract to include the key headline numbers (e.g., “+2.8% average accuracy over prior SOTA with 1.9× faster inference”) together with a brief reference to the main result table, allowing readers to assess the magnitude of the gains without first reading the full experimental section. revision: yes
Circularity Check
No circularity: empirical engineering solution without derivation chain
full rationale
The paper presents an empirical framework that begins from two stated observations about gradient norms and token convergence during latent training, then introduces visual replay and routing depth scaling modules plus a curriculum to address them. No equations, closed-form derivations, or parameter-fitting steps are described that could reduce any claimed prediction or result to its own inputs by construction. The approach is framed as an engineering intervention achieving SOTA performance and speedups, with no load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work. The derivation chain is therefore self-contained as a sequence of design choices motivated by (but not mathematically equivalent to) the initial observations.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding the best aggregate perception and reasoning scores on Qwen2.5-VL 7B among supervised variants while providing task-relevan...
Reference graph
Works this paper leans on
-
[1]
GQA: training generalized multi-query transformer models from multi-head checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr´on, and Sumit Sanghai. GQA: training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 4895–
work page 2023
-
[2]
Association for Computational Linguistics, 2023. 6
work page 2023
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Hallucination of Multimodal Large Language Models: A Survey
Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination 11 of multimodal large language models: A survey.CoRR, abs/2404.18930, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024. 6
work page 2024
-
[6]
M3cot: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought
Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M3cot: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought. InProc. of ACL,
-
[7]
Jeffrey Cheng and Benjamin Van Durme. Compressed chain of thought: Efficient reasoning through dense representations. arXiv preprint arXiv:2412.13171, 2024. 3
-
[8]
Thinking with generated images
Ethan Chern, Zhulin Hu, Steffi Chern, Siqi Kou, Jiadi Su, Yan Ma, Zhijie Deng, and Pengfei Liu. Thinking with generated images.arXiv preprint arXiv:2505.22525, 2025. 3
-
[9]
Federico Cocchi, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Augmenting multi- modal llms with self-reflective tokens for knowledge-based visual question answering. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 9199–9209. Computer Vision Foundation /...
work page 2025
-
[10]
Deepseek-r1: Incentivizing reasoning capabil- ity in llms via reinforcement learning, 2025
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capabil- ity in llms via reinforcement learning, 2025. 1, 3
work page 2025
-
[11]
OneThinker: All-in-one Reasoning Model for Image and Video
Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, Yan Feng, Peng Pei, Xunliang Cai, and Xiangyu Yue. Onethinker: All-in-one reasoning model for image and video.CoRR, abs/2512.03043, 2025. 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Blink: Multimodal large language mod- els can see but not perceive
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language mod- els can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024. 6
work page 2024
-
[13]
Soft Adaptive Policy Optimization
Chang Gao, Chujie Zheng, Xiong hui Chen, Kai Dang, Shix- uan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization.ArXiv, abs/2511.20347, 2025. 1
work page internal anchor Pith review arXiv 2025
-
[14]
Interleaved- modal chain-of-thought
Jun Gao, Yongqi Li, Ziqiang Cao, and Wenjie Li. Interleaved- modal chain-of-thought. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19520– 19529, 2025. 3, 6, 7
work page 2025
-
[15]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diag- nostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p...
work page 2024
-
[16]
Semantic-aware modular capsule routing for visual question answering.IEEE Trans
Yudong Han, Jianhua Yin, Jianlong Wu, Yinwei Wei, and Liqiang Nie. Semantic-aware modular capsule routing for visual question answering.IEEE Trans. Image Process., 32: 5537–5549, 2023. 3
work page 2023
-
[17]
Exploiting the social-like prior in transformer for visual reasoning
Yudong Han, Yupeng Hu, Xuemeng Song, Haoyu Tang, Mingzhu Xu, and Liqiang Nie. Exploiting the social-like prior in transformer for visual reasoning. InThirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelli- gence, IAAI 2024, Fourteenth Symposium on Educational Advances i...
work page 2024
-
[18]
Dynfocus: Dynamic cooperative network empowers llms with video understanding
Yudong Han, Qingpei Guo, Liyuan Pan, Liu Liu, Yu Guan, and Ming Yang. Dynfocus: Dynamic cooperative network empowers llms with video understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 8512–8522. Computer Vision Foundation / IEEE, 2025. 3
work page 2025
-
[19]
Training Large Language Models to Reason in a Continuous Latent Space
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, et al. Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769, 2024. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
LoRA: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. 2
work page 2022
-
[21]
arXiv preprint arXiv:2211.09699 (2022) 2, 4
Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, and Jiebo Luo. Promptcap: Prompt-guided task-aware image captioning.arXiv preprint arXiv:2211.09699, 2022. 3
-
[22]
Vi- sual sketchpad: sketching as a visual chain of thought for multimodal language models
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Vi- sual sketchpad: sketching as a visual chain of thought for multimodal language models. InProceedings of the 38th International Conference on Neural Information Processing Systems, pages 139348–139379, 2024. 3
work page 2024
-
[23]
Point- wise convolutional neural network.CoRR, abs/1712.05245,
Binh-Son Hua, Minh-Khoi Tran, and Sai-Kit Yeung. Point- wise convolutional neural network.CoRR, abs/1712.05245,
-
[24]
Vi- sual hallucinations of multi-modal large language models
Wen Huang, Hongbin Liu, Minxin Guo, and Neil Gong. Vi- sual hallucinations of multi-modal large language models. In Findings of the Association for Computational Linguistics: ACL 2024, 2024. 3
work page 2024
-
[26]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.CoRR, abs/2001.08361, 2020. 1
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[28]
Scaffolding coordinates to promote vision-language co- ordination in large multi-modal models
Xuanyu Lei, Zonghan Yang, Xinrui Chen, Peng Li, and Yang Liu. Scaffolding coordinates to promote vision-language co- ordination in large multi-modal models. InProceedings of the 31st International Conference on Computational Linguistics, pages 2886–2903, 2025. 1, 6, 7
work page 2025
-
[29]
Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. SEED-Bench-2-Plus: Bench- 12 marking multimodal large language models with text-rich vi- sual comprehension.arXiv preprint arXiv:2404.16790, 2024. 6
-
[30]
Llava-onevision: Easy visual task transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. Transactions on Machine Learning Research, 2024. 6, 7
work page 2024
-
[32]
Bangzheng Li et al. Latent visual reasoning.arXiv preprint arXiv:2509.24251, 2025. 7
work page internal anchor Pith review arXiv 2025
-
[33]
Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli´c, and Furu Wei. Imag- ine while reasoning in space: Multimodal visualization-of- thought.arXiv preprint arXiv:2501.07542, 2025. 3
-
[34]
Token activation map to visually explain multi- modal llms
Yi Li, Hualiang Wang, Xinpeng Ding, Haonan Wang, and Xiaomeng Li. Token activation map to visually explain multi- modal llms. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 48–58, 2025. 4
work page 2025
-
[35]
Reasoning within the mind: Dy- namic multimodal interleaving in latent space, 2025
Chengzhi Liu, Yuzhe Yang, Yue Fan, Qingyue Wei, Sheng Liu, and Xin Eric Wang. Reasoning within the mind: Dy- namic multimodal interleaving in latent space, 2025. 3, 7
work page 2025
-
[36]
Visual Ab- stract Thinking Empowers Multimodal Reasoning, May 2025a
Dairu Liu, Ziyue Wang, Minyuan Ruan, Fuwen Luo, Chi Chen, Peng Li, and Yang Liu. Visual abstract thinking empow- ers multimodal reasoning.arXiv preprint arXiv:2505.20164,
-
[37]
Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding
Ying Liu, Yudong Han, Kean Shi, and Liyuan Pan. Mema: Memory-augmented adapter for enhanced vision-language understanding.arXiv preprint arXiv:2603.00655, 2026. 3
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[38]
Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts. In The Twelfth International Conference on Learning Represen- tations. 6
-
[39]
Learn to explain: multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: multimodal reasoning via thought chains for science question answering. InProceed- ings of the 36th International Conference on Neural Informa- tion Processing Systems, pages 2507–2521, 2022. 6
work page 2022
-
[40]
Weijian Ma, Shizhao Sun, Tianyu Yu, Ruiyu Wang, Tat-Seng Chua, and Jiang Bian. Thinking with blueprints: Assisting vision-language models in spatial reasoning via structured object representation.CoRR, abs/2601.01984, 2026. 3
-
[41]
Compositional chain-of-thought prompting for large multimodal models
Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14420–14431, 2024. 1, 3, 6, 7
work page 2024
-
[42]
Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning
Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, and Godawari Sudhakar Rao. Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning. InPro- ceedings of the AAAI conference on artificial intelligence, pages 18798–18806, 2024. 3
work page 2024
-
[43]
OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Tan-Hanh Pham and Chris Ngo. Multimodal chain of contin- uous thought for latent-space reasoning in vision-language models.arXiv preprint arXiv:2508.12587, 2025. 3
-
[45]
Matthew Renze and Erhan Guven. Self-reflection in LLM agents: Effects on problem-solving performance.CoRR, abs/2405.06682, 2024. 3
-
[46]
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning. In Proceedings of the 38th International Conference on Neural Information Processing Systems, pages 8612–8642, 2024. 3
work page 2024
-
[47]
Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models, 2024
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuo- fan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models, 2024. 1
work page 2024
-
[48]
CODI: compressing chain-of-thought into continuous space via self-distillation
Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. CODI: compressing chain-of-thought into continuous space via self-distillation. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025, pages 677–693. Association for Computational Linguistics, 2025. 2
work page 2025
-
[49]
Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: Compressing chain-of-thought into continuous space via self-distillation.arXiv preprint arXiv:2502.21074, 2025. 3
-
[50]
Scaling laws for native multimodal models.arXiv preprint arXiv:2504.07951, 2025
Mustafa Shukor, Enrico Fini, Victor Guilherme Turrisi da Costa, Matthieu Cord, Joshua Susskind, and Alaaeldin El- Nouby. Scaling laws for native multimodal models.ArXiv, abs/2504.07951, 2025. 1
-
[51]
Kai Sun, Yushi Bai, Ji Qi, Lei Hou, and Juan-Zi Li. MM- MATH: advancing multimodal math evaluation with process evaluation and fine-grained classification. InFindings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-16, 2024, pages 1358–
work page 2024
-
[52]
Association for Computational Linguistics, 2024. 6
work page 2024
-
[53]
Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models
Wenhui Tan, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Jian Luan, and Ruihua Song. Think silently, think fast: Dy- namic latent compression of LLM reasoning chains.CoRR, abs/2505.16552, 2025. 1
-
[54]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Eyes wide shut? exploring the visual shortcomings of multimodal llms
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 6
work page 2024
-
[57]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self- 13 reflection of vision-language models with reinforcement learn- ing.arXiv preprint arXiv:2504.08837, 2025. 7
work page Pith review arXiv 2025
-
[58]
Measuring multimodal mathematical reasoning with math-vision dataset
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Sys- tems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024. 6
work page 2024
-
[59]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, and Yisen Wang. Monet: Reasoning in latent visual space beyond images and language. arXiv preprint arXiv:2511.21395, 2025. 7
-
[61]
Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7907–7915, 2025. 6
work page 2025
-
[63]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Forest Before Trees: Latent Superposition for Efficient Visual Reasoning
Yubo Wang, Juntian Zhang, Yichen Wu, Yankai Lin, Nils Lukas, and Yuhan Liu. Forest before trees: Latent su- perposition for efficient visual reasoning.arXiv preprint arXiv:2601.06803, 2026. 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[65]
Perception-Aware Policy Optimization for Multimodal Reasoning
Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, et al. Perception-aware pol- icy optimization for multimodal reasoning.arXiv preprint arXiv:2507.06448, 2025. 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Sim-cot: Supervised implicit chain-of-thought.arXiv preprint arXiv:2509.20317, 2025
Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, and Dahua Lin. Sim-cot: Supervised implicit chain-of-thought.CoRR, abs/2509.20317,
-
[67]
Efficient streaming language models with attention sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 4
work page 2024
-
[68]
Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, and Ivan Vuli´c. Visual planning: Let’s think only with images.arXiv preprint arXiv:2505.11409, 2025. 3
-
[69]
Look-back: Implicit visual re-focusing in MLLM reasoning.CoRR, abs/2507.03019, 2025
Shuo Yang, Yuwei Niu, Yuyang Liu, Yang Ye, Bin Lin, and Li Yuan. Look-back: Implicit visual re-focusing in MLLM reasoning.CoRR, abs/2507.03019, 2025. 1
-
[70]
Look-back: Implicit visual re-focusing in mllm reasoning
Shuo Yang, Yuwei Niu, Yuyang Liu, Yang Ye, Bin Lin, and Li Yuan. Look-back: Implicit visual re-focusing in mllm reasoning. 2025. 1, 3
work page 2025
-
[71]
Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. Machine mental imagery: Empower multi- modal reasoning with latent visual tokens.arXiv preprint arXiv:2506.17218, 2025. 3
-
[72]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Honglin Yu, Weinan Dai, Yuxuan Song, Xiang Wei, Haodong Zhou, Jingjing Liu...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[73]
Runpeng Yu, Xinyin Ma, and Xinchao Wang. Introducing visual perception token into multimodal large language model. CoRR, abs/2502.17425, 2025. 1
-
[74]
Xinlei Yu, Chengming Xu, Guibin Zhang, Zhangquan Chen, Yudong Zhang, Yongbo He, Peng-Tao Jiang, Jiangning Zhang, Xiaobin Hu, and Shuicheng Yan. Vismem: Latent vision memory unlocks potential of vision-language models.ArXiv, abs/2511.11007, 2025. 3
-
[75]
Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, Guibin Zhang, Jiale Tao, Jiayi Zhang, Siyuan Ma, Kaituo Feng, Haojie Huang, Youxing Li, Rong Chen, Huacan Wang, Chenglin Wu, Z.Y . Su, Xiaogang Xu, Kelu Yao, Kun Wang, Chen Gao, Yue Liao, Ruqi Huang, Tao Jin, Cheng Tan, Jiang- She Zhang, Wenqi R...
-
[76]
Mllms know where to look: Training-free per- ception of small visual details with multimodal llms
Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Mllms know where to look: Training-free per- ception of small visual details with multimodal llms. InThe Thirteenth International Conference on Learning Represen- tations, ICLR 2025, Singapore, April 24-28, 2025. OpenRe- view.net, 2025. 1
work page 2025
-
[77]
Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xi- aowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025. 3, 6, 7
-
[78]
From redundancy to relevance: Infor- mation flow in lvlms across reasoning tasks
Xiaofeng Zhang, Yihao Quan, Chen Shen, Xiaosong Yuan, Shaotian Yan, Liang Xie, Wenxiao Wang, Chaochen Gu, Hao Tang, and Jieping Ye. From redundancy to relevance: Infor- mation flow in lvlms across reasoning tasks. InProceedings of the 2025 Conference of the Nations of the Americas Chap- ter of the Association for Computational Linguistics: Human Language ...
work page 2025
-
[79]
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought rea- 14 soning in language models.Transactions on Machine Learn- ing Research, 2024, 2024. 1, 3, 6, 7
work page 2024
-
[80]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.CoRR, abs/2601.18734, 2026. 3
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[81]
Ddcot: duty-distinct chain-of-thought prompting for multimodal reasoning in language models
Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, and Sibei Yang. Ddcot: duty-distinct chain-of-thought prompting for multimodal reasoning in language models. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 5168–5191, 2023. 3
work page 2023
-
[82]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing” thinking with images” via reinforcement learn- ing.arXiv preprint arXiv:2505.14362, 2025. 7 15
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.