pith. sign in

arxiv: 2512.10941 · v2 · submitted 2025-12-11 · 💻 cs.CV · cs.AI

Mull-Tokens: Modality-Agnostic Latent Thinking

Pith reviewed 2026-05-16 22:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords Mull-Tokenslatent reasoningmodality-agnostic tokensspatial reasoningmultimodal reasoninginterleaved tracespuzzle solvingperspective taking
0
0 comments X

The pith

Mull-Tokens are modality-agnostic latent tokens that let models reason across text and image space using only final-answer supervision after initial interleaved training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Mull-Tokens as a simpler way for multimodal models to perform free-form reasoning that mixes text and visual information without calling external tools or generating images on the fly. These tokens are first trained on examples that interleave text and image steps, then refined using only the final correct answers with no further modality guidance. On four spatial reasoning benchmarks that include puzzle solving and viewpoint changes, the approach beats text-only baselines and explicit interleaved image-text methods by an average of 3 percent, with a peak gain of 16 percent on the hardest puzzle split. A sympathetic reader would care because current multimodal reasoning methods are brittle and expensive; if latent tokens can carry useful cross-modal information in a compact form, models could scale reasoning without handcrafted traces or costly generation steps. The core premise is that pre-training on mixed traces creates reusable intermediate representations that survive answer-only fine-tuning.

Core claim

Mull-Tokens are modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split.

What carries the argument

Mull-Tokens: modality-agnostic latent tokens that encode cross-modal intermediate reasoning steps after pre-training on interleaved traces.

If this is right

  • Mull-Tokens produce higher accuracy on spatial reasoning tasks than either pure text reasoning or explicit interleaved image-text baselines.
  • Initial training on interleaved text-image traces followed by answer-only fine-tuning is sufficient to obtain the reported gains.
  • The method eliminates the need for specialist tools or on-the-fly image generation during inference.
  • Largest improvements appear on reasoning-intensive splits such as puzzle solving, reaching 16 percent over the strongest baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar latent-token pre-training could be tested on non-spatial domains such as temporal planning or causal inference to check whether the cross-modal benefit generalizes.
  • The two-stage recipe might reduce the volume of expensive interleaved supervision data required for future multimodal models.
  • If the tokens truly remain modality-agnostic after fine-tuning, they could serve as a drop-in module for existing vision-language architectures without retraining the entire model.
  • One could measure whether the same tokens retain utility when the final fine-tuning objective includes partial credit on intermediate steps rather than only the final answer.

Load-bearing premise

Latent tokens pre-trained on supervised interleaved traces will continue to encode useful cross-modal intermediate information when fine-tuned with only final-answer supervision and no further modality-specific guidance.

What would settle it

An ablation in which models trained from scratch with only final-answer supervision match or exceed the performance of the two-stage Mull-Tokens pipeline on the same spatial reasoning benchmarks would falsify the necessity of the interleaved pre-training step.

Figures

Figures reproduced from arXiv: 2512.10941 by Ahmed Abdelkader, Arijit Ray, Bryan A. Plummer, Chengzhi Mao, Kate Saenko, Leonidas Guibas, Ranjay Krishna, Wen-Sheng Chu.

Figure 1
Figure 1. Figure 1: Compared to existing approaches for reasoning in text [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Our  ul¬-Tokens training involves two stages inspired by approaches in latent reasoning. We first pre-train/warm-up our  ul¬- Tokens to hold both image and text modalities depending on the context image/video and query. Next, the model free-form optimizes these  ul¬-Tokens to achieve the final correct answer. We see that pre-training the  ul¬-Tokens with to hold both image and text reasoning traces… view at source ↗
Figure 3
Figure 3. Figure 3: Examples of training and our benchmark data. We test on diverse reasoning benchmarks, which includes hard examples that do [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Hyperparameter choices in latent design. (a) Choice of continuous embeddings vs discrete tokens: We see that discrete performs [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5:  ul¬-Tokens tokens can be used in conjunction with text reasoning and the model can effectively decide when to also avoid using text reasoning depending on the task. For the example on the left, the model accurately uses  ul¬-Tokens along with some textual descriptive cues to reason about the missing piece. For the example on the right, the model decides to simply use the  ul¬-Tokens to directly pred… view at source ↗
Figure 6
Figure 6. Figure 6: Some qualitative examples of failures we observe with trying the existing approach of using explicit switching between image [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Mull-Tokens, modality-agnostic latent tokens pre-trained on supervised interleaved text-image reasoning traces and then fine-tuned solely on final-answer labels. It evaluates the approach on four spatial reasoning benchmarks (puzzle solving, perspective taking, etc.), claiming average gains of +3% and up to +16% on a reasoning-heavy split relative to text-only and interleaved image-text baselines.

Significance. If the two-stage procedure demonstrably preserves cross-modal intermediate representations, the method offers a lightweight alternative to tool-calling or handcrafted trace generation for multimodal reasoning. The pre-training on interleaved traces followed by answer-only fine-tuning is a clean design choice whose value would be strengthened by explicit isolation of each stage.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the reported +3% average and +16% split gains are presented without baseline implementation details, statistical significance tests, data-split descriptions, or ablation controls that isolate the interleaved pre-training stage from standard supervised fine-tuning. This directly affects the central claim that gains arise from modality-agnostic latent thinking rather than capacity increases.
  2. [§3.2] §3.2 (Training Procedure): no loss terms, regularizers, or probing experiments are described that would prevent the latent tokens from collapsing to generic capacity during answer-only fine-tuning. The skeptic concern that cross-modal intermediate information may be erased is therefore unaddressed by any concrete test in the manuscript.
minor comments (2)
  1. [§3.1] Notation for Mull-Tokens is introduced without an explicit equation defining their dimensionality or injection points into the transformer; a single equation in §3.1 would improve reproducibility.
  2. [Figure 2] Figure 2 (qualitative examples) lacks error bars or per-task breakdowns that would clarify whether the +16% split gain is driven by a small number of puzzles.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We have revised the manuscript to include baseline implementation details, statistical tests, data-split descriptions, and new probing experiments. These additions strengthen the evidence that gains arise from the two-stage modality-agnostic training rather than capacity alone. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the reported +3% average and +16% split gains are presented without baseline implementation details, statistical significance tests, data-split descriptions, or ablation controls that isolate the interleaved pre-training stage from standard supervised fine-tuning. This directly affects the central claim that gains arise from modality-agnostic latent thinking rather than capacity increases.

    Authors: We agree that these details are essential for validating the central claim. In the revised manuscript, §4 now includes full baseline implementation details (model sizes, training hyperparameters, and code pointers), data-split descriptions (with exact train/val/test sizes per benchmark), and paired t-test results showing statistical significance (p<0.05) for the reported gains. We have added an ablation study comparing the full two-stage procedure against direct answer-only fine-tuning from the same base model; the two-stage version outperforms by 4-9 points on the reasoning-heavy splits, isolating the contribution of interleaved pre-training beyond capacity. These revisions clarify that improvements stem from preserved cross-modal latent thinking. revision: yes

  2. Referee: [§3.2] §3.2 (Training Procedure): no loss terms, regularizers, or probing experiments are described that would prevent the latent tokens from collapsing to generic capacity during answer-only fine-tuning. The skeptic concern that cross-modal intermediate information may be erased is therefore unaddressed by any concrete test in the manuscript.

    Authors: We acknowledge the concern that answer-only fine-tuning could erase cross-modal information. The original §3.2 described the pre-training loss as standard cross-entropy over interleaved trace tokens and fine-tuning as next-token prediction on final answers with a reduced learning rate (1e-5) to limit drift. To directly test retention, the revised version adds probing experiments: after fine-tuning, we train linear probes on the Mull-Tokens to reconstruct intermediate image features and text tokens from the original traces. These probes achieve 72% accuracy on image reconstruction and 81% on text, significantly above random baselines, indicating that cross-modal information is preserved. We have also clarified that no additional regularizers were used because the low learning rate and short fine-tuning schedule (3 epochs) suffice to maintain the pre-trained representations. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical two-stage training procedure (pre-training Mull-Tokens on supervised interleaved text-image traces, followed by fine-tuning solely on final-answer labels) and reports accuracy gains on four external public spatial-reasoning benchmarks. No equations, fitted parameters, or self-citations are invoked in a manner that reduces the claimed improvements to quantities defined or optimized inside the same training loop. The evaluation uses held-out benchmarks whose labels are independent of the training traces, satisfying the criterion for a self-contained result against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven premise that a small set of learned tokens can reliably encode cross-modal intermediate states and that unsupervised fine-tuning preserves those states.

axioms (1)
  • domain assumption Latent tokens can be trained to hold useful intermediate information across image and text modalities
    Invoked in the description of pre-training and fine-tuning stages
invented entities (1)
  • Mull-Tokens no independent evidence
    purpose: Modality-agnostic latent tokens for free-form multimodal thinking
    Newly introduced construct whose utility is demonstrated only through the reported benchmark gains

pith-pipeline@v0.9.0 · 5543 in / 1272 out tokens · 50689 ms · 2026-05-16T22:54:08.648765+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?

    cs.CV 2026-05 unverdicted novelty 7.0

    The Token Replacement Test shows VLMs keep most accuracy gains even after corrupting or replacing continuous thought token content, indicating the tokens are not used as information bottlenecks.

  2. Hybrid Latent Reasoning with Decoupled Policy Optimization

    cs.CV 2026-04 unverdicted novelty 7.0

    HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.

  3. Forest Before Trees: Latent Superposition for Efficient Visual Reasoning

    cs.CL 2026-01 unverdicted novelty 7.0

    Laser reformulates visual reasoning via Dynamic Windowed Alignment Learning to maintain latent superposition of global features, delivering 5.03% average gains over Monet and over 97% fewer inference tokens on six benchmarks.

  4. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.

  5. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...

  6. Do multimodal models imagine electric sheep?

    cs.CV 2026-05 conditional novelty 6.0

    Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.

  7. Semantic-Enriched Latent Visual Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    SLVR enriches latent visual representations with fine-grained attribute semantics via supervised first-stage learning and multi-query alignment via M-GRPO, yielding improved robustness on region-level reasoning tasks.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · cited by 6 Pith papers · 17 internal anchors

  1. [1]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 2, 3, 14

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 3, 5, 9

  4. [4]

    Ball and Jakob Bauer et al

    Philip J. Ball and Jakob Bauer et al. Genie 3: A new frontier for world models, 2025. 9

  5. [5]

    Per- ception tokens enhance visual reasoning in multimodal lan- guage models

    Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G Shapiro, and Ranjay Krishna. Per- ception tokens enhance visual reasoning in multimodal lan- guage models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3836–3845, 2025. 1, 2, 3, 4

  6. [6]

    SIMS-V: Simulated instruction- tuning for spatial video understanding, 2025

    Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, and Saining Xie. SIMS-V: Simulated instruction- tuning for spatial video understanding, 2025. 1, 3

  7. [7]

    Morse-500: A program- matically controllable video benchmark to stress-test multi- modal reasoning.arXiv preprint arXiv:2506.05523, 2025

    Zikui Cai, Andrew Wang, Anirudh Satheesh, Ankit Nakhawa, Hyunwoo Jae, Keenan Powell, Minghui Liu, Neel Jay, Sungbin Oh, Xiyao Wang, et al. Morse-500: A program- matically controllable video benchmark to stress-test multi- modal reasoning.arXiv preprint arXiv:2506.05523, 2025. 1

  8. [8]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Sil- vio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. 1

  9. [9]

    R1-v: Reinforcing super generalization ability in vision- language models with less than $3.https://github

    Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generalization ability in vision- language models with less than $3.https://github. com/Deep-Agent/R1-V, 2025. Accessed: 2025-02-02. 4, 6

  10. [10]

    Towards reasoning era: A survey of long chain-of-thought for reasoning large language models,

    Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jian- nan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models,

  11. [11]

    Think with 3d: Geometric imagina- tion grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025

    Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Yan Feng, Peng Pei, Xunliang Cai, and Ruqi Huang. Think with 3d: Geometric imagina- tion grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025. 2

  12. [12]

    Spatial- rgpt: Grounded spatial reasoning in vision-language mod- els

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Rui- han Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatial- rgpt: Grounded spatial reasoning in vision-language mod- els. InAdvances in Neural Information Processing Systems, pages 135062–135093. Curran Associates, Inc., 2024. 2, 3

  13. [13]

    arXiv preprint arXiv:2407.06135

    Ethan Chern, Jiadi Su, Yan Ma, and Pengfei Liu. Anole: An open, autoregressive, native large multimodal mod- els for interleaved image-text generation.arXiv preprint arXiv:2407.06135, 2024. 1

  14. [14]

    Thinking with gen- erated images.arXiv preprint arXiv:2505.22525, 2025

    Ethan Chern, Zhulin Hu, Steffi Chern, Siqi Kou, Jiadi Su, Yan Ma, Zhijie Deng, and Pengfei Liu. Thinking with gen- erated images.arXiv preprint arXiv:2505.22525, 2025. 2

  15. [15]

    Mm-spatial: Exploring 3d spatial understanding in multi- modal LLMs, 2025

    Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, and Peter Grasch. Mm-spatial: Exploring 3d spatial understanding in multi- modal LLMs, 2025. 3

  16. [16]

    Towards revealing the mystery behind chain of thought: A theoretical perspective

    Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective. InAdvances in Neural Information Processing Systems, pages 70757– 70798. Curran Associates, Inc., 2023. 2

  17. [17]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yib- ing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-R1: Reinforcing video reasoning in MLLMs.arXiv preprint arXiv:2503.21776, 2025. 2, 4, 5, 6, 13

  18. [18]

    Hidden in plain sight: Vlms overlook their visual representations.arXiv preprint arXiv:2506.08008, 2025

    Stephanie Fu, Tyler Bonnen, Devin Guillory, and Trevor Darrell. Hidden in plain sight: Vlms overlook their visual representations.arXiv preprint arXiv:2506.08008, 2025. 2

  19. [19]

    BLINK: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. BLINK: Multimodal large language models can see but not perceive. InEuropean Conference on Com- puter Vision, pages 148–166. Springer, 2024. 2, 5

  20. [20]

    Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025. 3

  21. [21]

    The delta learning hypothesis: Preference tuning on weak data can yield strong gains, 2025

    Scott Geng, Hamish Ivison, Chun-Liang Li, Maarten Sap, Jerry Li, Ranjay Krishna, and Pang Wei Koh. The delta learning hypothesis: Preference tuning on weak data can yield strong gains, 2025. 6

  22. [22]

    Think before you speak: Training language models with pause tokens

    Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. InIn- ternational Conference on Learning Representations (ICLR),

  23. [23]

    Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning

    Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, and Yu 10 Cheng. Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning.arXiv preprint arXiv:2510.27492, 2025. 1, 2, 3

  24. [24]

    Deepseek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645(8081):633– 638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645(8081):633– 638, 2025. 2, 4, 6

  25. [25]

    Training Large Language Models to Reason in a Continuous Latent Space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large lan- guage models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024. 3, 4, 5, 7

  26. [26]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models, 2024

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Osten- dorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Kr- ishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.arXiv preprint arXiv:2406.09403, 2024. 1, 2, 3

  27. [27]

    Explain before you answer: A survey on compositional visual reasoning

    Fucai Ke, Joy Hsu, Zhixi Cai, Zixian Ma, Xin Zheng, Xindi Wu, Sukai Huang, Weiqing Wang, Pari Delir Haghighi, Gholamreza Haffari, et al. Explain before you answer: A survey on compositional visual reasoning.arXiv preprint arXiv:2508.17298, 2025. 2

  28. [28]

    e1: Learning adaptive con- trol of reasoning effort.arXiv preprint arXiv:2510.27042,

    Michael Kleinman, Matthew Trager, Alessandro Achille, Wei Xia, and Stefano Soatto. e1: Learning adaptive con- trol of reasoning effort.arXiv preprint arXiv:2510.27042,

  29. [29]

    MolmoAct: Action Reasoning Models that can Reason in Space

    Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917, 2025. 1, 2

  30. [30]

    Zebra-cot: A dataset for interleaved vision language reasoning.arXiv preprint arXiv:2507.16746, 2025

    Ang Li, Charles Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, et al. Zebra-cot: A dataset for interleaved vi- sion language reasoning.arXiv preprint arXiv:2507.16746,

  31. [31]

    Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

    Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli´c, and Furu Wei. Imag- ine while reasoning in space: Multimodal visualization-of- thought.arXiv preprint arXiv:2501.07542, 2025. 2

  32. [32]

    Unfolding spatial cognition: Evaluating multimodal models on visual simulations

    Linjie Li, Mahtab Bigverdi, Jiawei Gu, Zixian Ma, Yinuo Yang, Ziang Li, Yejin Choi, and Ranjay Krishna. Unfolding spatial cognition: Evaluating multimodal models on visual simulations.arXiv preprint arXiv:2506.04633, 2025. 1

  33. [33]

    Think or not think: A study of ex- plicit thinking in rule-based visual reinforcement fine-tuning

    Ming Li, Jike Zhong, Shitian Zhao, Yuxiang Lai, and Kaipeng Zhang. Think or not think: A study of ex- plicit thinking in rule-based visual reinforcement fine-tuning. arXiv preprint arXiv:2503.16188, 2025. 2

  34. [34]

    Lost in embeddings: Information loss in vision-language models, 2025

    Wenyan Li, Raphael Tang, Chengzu Li, Caiqi Zhang, Ivan Vuli´c, and Anders Søgaard. Lost in embeddings: Infor- mation loss in vision-language models.arXiv preprint arXiv:2509.11986, 2025. 2

  35. [35]

    Visual representations inside the language model.arXiv preprint arXiv:2510.04819,

    Benlin Liu, Amita Kamath, Madeleine Grunde-McLaughlin, Winson Han, and Ranjay Krishna. Visual representations in- side the language model.arXiv preprint arXiv:2510.04819,

  36. [36]

    Deconstructing spatial intelligence in vision-language models, 2025

    Disheng Liu, Tuo Liang, Zhe Hu, Jierui Peng, Yiren Lu, Yi Xu, Yun Fu, and Yu Yin. Deconstructing spatial intelligence in vision-language models, 2025. 2

  37. [37]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, pages 34892–34916. Curran Associates, Inc., 2023. 2, 3

  38. [38]

    Wu, Ilia Sucholutsky, Tania Lombrozo, and Thomas L

    Ryan Liu, Jiayi Geng, Addison J. Wu, Ilia Sucholutsky, Tania Lombrozo, and Thomas L. Griffiths. Mind your step (by step): Chain-of-thought can reduce performance on tasks where thinking makes humans worse, 2025. 1

  39. [39]

    Hyper-bagel: A unified acceleration framework for multimodal understand- ing and generation, 2025

    Yanzuo Lu, Xin Xia, Manlin Zhang, Huafeng Kuang, Jian- bin Zheng, Yuxi Ren, and Xuefeng Xiao. Hyper-bagel: A unified acceleration framework for multimodal understand- ing and generation, 2025. 2

  40. [40]

    When thinking drifts: Evidential grounding for robust video reasoning, 2025

    Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman. When thinking drifts: Evidential grounding for robust video reasoning, 2025. 2, 3, 5, 13

  41. [41]

    Taco: Learning multi-modal action models with synthetic chains-of-thought-and-action, 2024

    Zixian Ma, Jianguo Zhang, Zhiwei Liu, Jieyu Zhang, Jun- tao Tan, Manli Shu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Caiming Xiong, Ranjay Krishna, and Silvio Savarese. Taco: Learning multi-modal action models with synthetic chains-of-thought-and-action, 2024. 1

  42. [42]

    MIT Press, 2024

    Hanspeter A Mallot.From geometry to behavior: An intro- duction to spatial cognition. MIT Press, 2024. 1

  43. [43]

    Tips: Text- image pretraining with spatial awareness, 2025

    Kevis-Kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Ar- jun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, Dan Gnanapragasam, Mojtaba Seyedhosseini, Howard Zhou, and Andre Araujo. Tips: Text- image pretraining with spatial awareness, 2025. 3

  44. [44]

    Thinking with images.https://openai

    OpenAI. Thinking with images.https://openai. com/index/thinking-with-images/, 2025. 1, 2

  45. [45]

    Vision language models are blind

    Pooyan Rahmanzadehgervi, Logan Bolton, Moham- mad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind. InProceedings of the Asian Conference on Computer Vision (ACCV), pages 18–34, 2024. 2

  46. [46]

    Deepspeed: System optimizations enable train- ing deep learning models with over 100 billion parame- ters

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable train- ing deep learning models with over 100 billion parame- ters. InProceedings of the 26th ACM SIGKDD Interna- tional Conference on Knowledge Discovery & Data Mining, page 3505–3506, New York, NY , USA, 2020. Association for Computing Machinery. 5, 13

  47. [47]

    Plummer, Ranjay Krishna, and Kate Saenko

    Arijit Ray, Filip Radenovic, Abhimanyu Dubey, Bryan A. Plummer, Ranjay Krishna, and Kate Saenko. Cola: A bench- mark for compositional text-to-image retrieval, 2023. 2

  48. [48]

    Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko

    Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, Kuo-Hao Zeng, et al. Sat: Spa- tial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755, 2024. 1, 2, 3, 5, 6, 13

  49. [49]

    CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

    Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation.arXiv preprint arXiv:2502.21074, 2025. 2, 3

  50. [50]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar. Scaling LLM test-time compute optimally can be more 11 effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024. 3

  51. [51]

    Emu: Generative pretraining in multimodality

    Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Emu: Generative pretraining in multimodality. InThe Twelfth International Conference on Learning Representations, 2024. 2

  52. [52]

    Stop when enough: Adaptive early-stopping for chain-of-thought reasoning.arXiv preprint arXiv:2510.10103, 2025

    Renliang Sun, Wei Cheng, Dawei Li, Haifeng Chen, and Wei Wang. Stop when enough: Adaptive early- stopping for chain-of-thought reasoning.arXiv preprint arXiv:2510.10103, 2025. 2

  53. [53]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 1, 2

  54. [54]

    Cvbench: A benchmark for cross-video multimodal reasoning, 2025

    CVBench Team. Cvbench: A benchmark for cross-video multimodal reasoning, 2025. 2

  55. [55]

    Gemini Robotics: Bringing AI into the Physical World

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Are- nas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025. 2, 5

  56. [56]

    Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs,

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs,

  57. [57]

    Oxford University Press, 2009

    Barbara Tversky and Masaki Suwa.Thinking with sketches, pages 75–84. Oxford University Press, 2009. 1

  58. [58]

    Development of spatial cognition.Wiley Interdisciplinary Reviews: Cognitive Science, 3(3):349–362, 2012

    Marina Vasilyeva and Stella F Lourenco. Development of spatial cognition.Wiley Interdisciplinary Reviews: Cognitive Science, 3(3):349–362, 2012. 1

  59. [59]

    Is a picture worth a thou- sand words? delving into spatial reasoning for vision lan- guage models, 2024

    Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. Is a picture worth a thou- sand words? delving into spatial reasoning for vision lan- guage models, 2024. 1

  60. [60]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Pro- cessing Systems, pages 24824–24837. Curran Associates, Inc., 2022. 2

  61. [61]

    Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 1

  62. [62]

    Stop spinning wheels: Mitigating LLM overthinking via mining patterns for early reasoning exit, 2025

    Zihao Wei, Liang Pang, Jiahao Liu, Jingcheng Deng, Shicheng Xu, Zenghao Duan, Jingang Wang, Fei Sun, Xun- liang Cai, Huawei Shen, and Xueqi Cheng. Stop spinning wheels: Mitigating LLM overthinking via mining patterns for early reasoning exit, 2025. 8

  63. [63]

    Janus: Decoupling visual encoding for unified multimodal understanding and genera- tion

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. Janus: Decoupling visual encoding for unified multimodal understanding and genera- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 12966– 12977, 2025. 2

  64. [64]

    VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

    Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model inte- grating visual understanding and generation.arXiv preprint arXiv:2409.04429, 2024

  65. [65]

    Show-o2: Improved Native Unified Multimodal Models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025. 2

  66. [66]

    Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

    Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10632–10643, 2025. 1, 2, 3, 5

  67. [67]

    Cambrian-s: Towards spatial super- sensing in video, 2025

    Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zi- hao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei- Fei, and Saining Xie. Cambrian-s: Towards spatial super- sensing in video, 2025. 2, 3

  68. [68]

    Machine mental imagery: Empower multi- modal reasoning with latent visual tokens, 2025

    Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. Machine mental imagery: Empower multi- modal reasoning with latent visual tokens, 2025. 1, 2, 3, 4, 5, 6, 8, 9, 13

  69. [69]

    Hybrid latent reasoning via reinforcement learning.arXiv preprint arXiv:2505.18454, 2025

    Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, and Dong Wang. Hybrid latent reasoning via reinforcement learning.arXiv preprint arXiv:2505.18454, 2025. 2

  70. [70]

    DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning

    Chi Zhang, Haibo Qiu, Qiming Zhang, Zhixiong Zeng, Lin Ma, and Jing Zhang. Deepsketcher: Internalizing vi- sual manipulation for multimodal reasoning.arXiv preprint arXiv:2509.25866, 2025. 2

  71. [71]

    Lmms- eval: Accelerating the development of large multimoal mod- els, 2024

    Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms- eval: Accelerating the development of large multimoal mod- els, 2024. 5

  72. [72]

    Unified multimodal understanding and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567,

    Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Jiakui Hu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, et al. Unified multimodal under- standing and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567, 2025. 2

  73. [73]

    Multimodal chain-of-thought rea- soning in language models.Transactions on Machine Learn- ing Research, 2024

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought rea- soning in language models.Transactions on Machine Learn- ing Research, 2024. 2

  74. [74]

    Cot-vla: Visual chain-of-thought reasoning for vision-language-action mod- els

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Tsung-Yi Lin, Gordon Wet- zstein, Ming-Yu Liu, and Donglai Xiang. Cot-vla: Visual chain-of-thought reasoning for vision-language-action mod- els. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern R...

  75. [75]

    Multimodal spatial reasoning in the large model era: A survey and benchmarks.arXiv preprint arXiv:2510.25760, 2025

    Xu Zheng, Zihao Dongfang, Lutao Jiang, Boyuan Zheng, Yulong Guo, Zhenquan Zhang, Giuliano Albanese, Runyi 12 Yang, Mengjiao Ma, Zixin Zhang, et al. Multimodal spatial reasoning in the large model era: A survey and benchmarks. https://arxiv.org/abs/2510.25760, 2025. 2

  76. [76]

    From perception to cognition: A survey of vision-language interactive rea- soning in multimodal large language models,

    Chenyue Zhou, Mingxuan Wang, Yanbiao Ma, Chenxu Wu, Wanyi Chen, Zhe Qian, Xinyu Liu, Yiwei Zhang, Jun- hao Wang, Hengbo Xu, et al. From perception to cog- nition: A survey of vision-language interactive reason- ing in multimodal large language models.arXiv preprint arXiv:2509.25373, 2025. 2

  77. [77]

    Image-of- thought prompting for visual reasoning refinement in multimodal large language models, 2024

    Qiji Zhou, Ruochen Zhou, Zike Hu, Panzhong Lu, Siyang Gao, and Yue Zhang. Image-of-thought prompting for visual reasoning refinement in multimodal large language models. arXiv preprint arXiv:2405.13872, 2024. 1

  78. [78]

    Scaling Latent Reasoning via Looped Language Models

    Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, et al. Scaling latent reasoning via looped language models. arXiv preprint arXiv:2510.25741, 2025. 3

  79. [79]

    Please provide only the single op- tion letter

    Appendix In this supplementary document, we include further abla- tions in the training design choices, more details, and more insights into the shortcomings of related existing work ver- sus our approach using qualitative examples. Finally, we also provide some qualitative examples to demonstrate our insights. 6.1. Training Details We train using Deepspe...

  80. [80]

    To reproduce text-based reasoning baselines, we utilize the template established in prior work [17, 40, 68]

    Text-Reasoning Baseline (Video-R1). To reproduce text-based reasoning baselines, we utilize the template established in prior work [17, 40, 68]. {Question} Please think about this question as if you were a human pondering deeply. Engage in an internal dialogue using expressions such as ’let me think’, ’wait’, ’Hmm’, ’oh, I see’, ’let’s break it down’, etc...

Showing first 80 references.