Mull-Tokens: Modality-Agnostic Latent Thinking
Pith reviewed 2026-05-16 22:54 UTC · model grok-4.3
The pith
Mull-Tokens are modality-agnostic latent tokens that let models reason across text and image space using only final-answer supervision after initial interleaved training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mull-Tokens are modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split.
What carries the argument
Mull-Tokens: modality-agnostic latent tokens that encode cross-modal intermediate reasoning steps after pre-training on interleaved traces.
If this is right
- Mull-Tokens produce higher accuracy on spatial reasoning tasks than either pure text reasoning or explicit interleaved image-text baselines.
- Initial training on interleaved text-image traces followed by answer-only fine-tuning is sufficient to obtain the reported gains.
- The method eliminates the need for specialist tools or on-the-fly image generation during inference.
- Largest improvements appear on reasoning-intensive splits such as puzzle solving, reaching 16 percent over the strongest baseline.
Where Pith is reading between the lines
- Similar latent-token pre-training could be tested on non-spatial domains such as temporal planning or causal inference to check whether the cross-modal benefit generalizes.
- The two-stage recipe might reduce the volume of expensive interleaved supervision data required for future multimodal models.
- If the tokens truly remain modality-agnostic after fine-tuning, they could serve as a drop-in module for existing vision-language architectures without retraining the entire model.
- One could measure whether the same tokens retain utility when the final fine-tuning objective includes partial credit on intermediate steps rather than only the final answer.
Load-bearing premise
Latent tokens pre-trained on supervised interleaved traces will continue to encode useful cross-modal intermediate information when fine-tuned with only final-answer supervision and no further modality-specific guidance.
What would settle it
An ablation in which models trained from scratch with only final-answer supervision match or exceed the performance of the two-stage Mull-Tokens pipeline on the same spatial reasoning benchmarks would falsify the necessity of the interleaved pre-training step.
Figures
read the original abstract
Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Mull-Tokens, modality-agnostic latent tokens pre-trained on supervised interleaved text-image reasoning traces and then fine-tuned solely on final-answer labels. It evaluates the approach on four spatial reasoning benchmarks (puzzle solving, perspective taking, etc.), claiming average gains of +3% and up to +16% on a reasoning-heavy split relative to text-only and interleaved image-text baselines.
Significance. If the two-stage procedure demonstrably preserves cross-modal intermediate representations, the method offers a lightweight alternative to tool-calling or handcrafted trace generation for multimodal reasoning. The pre-training on interleaved traces followed by answer-only fine-tuning is a clean design choice whose value would be strengthened by explicit isolation of each stage.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): the reported +3% average and +16% split gains are presented without baseline implementation details, statistical significance tests, data-split descriptions, or ablation controls that isolate the interleaved pre-training stage from standard supervised fine-tuning. This directly affects the central claim that gains arise from modality-agnostic latent thinking rather than capacity increases.
- [§3.2] §3.2 (Training Procedure): no loss terms, regularizers, or probing experiments are described that would prevent the latent tokens from collapsing to generic capacity during answer-only fine-tuning. The skeptic concern that cross-modal intermediate information may be erased is therefore unaddressed by any concrete test in the manuscript.
minor comments (2)
- [§3.1] Notation for Mull-Tokens is introduced without an explicit equation defining their dimensionality or injection points into the transformer; a single equation in §3.1 would improve reproducibility.
- [Figure 2] Figure 2 (qualitative examples) lacks error bars or per-task breakdowns that would clarify whether the +16% split gain is driven by a small number of puzzles.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We have revised the manuscript to include baseline implementation details, statistical tests, data-split descriptions, and new probing experiments. These additions strengthen the evidence that gains arise from the two-stage modality-agnostic training rather than capacity alone. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the reported +3% average and +16% split gains are presented without baseline implementation details, statistical significance tests, data-split descriptions, or ablation controls that isolate the interleaved pre-training stage from standard supervised fine-tuning. This directly affects the central claim that gains arise from modality-agnostic latent thinking rather than capacity increases.
Authors: We agree that these details are essential for validating the central claim. In the revised manuscript, §4 now includes full baseline implementation details (model sizes, training hyperparameters, and code pointers), data-split descriptions (with exact train/val/test sizes per benchmark), and paired t-test results showing statistical significance (p<0.05) for the reported gains. We have added an ablation study comparing the full two-stage procedure against direct answer-only fine-tuning from the same base model; the two-stage version outperforms by 4-9 points on the reasoning-heavy splits, isolating the contribution of interleaved pre-training beyond capacity. These revisions clarify that improvements stem from preserved cross-modal latent thinking. revision: yes
-
Referee: [§3.2] §3.2 (Training Procedure): no loss terms, regularizers, or probing experiments are described that would prevent the latent tokens from collapsing to generic capacity during answer-only fine-tuning. The skeptic concern that cross-modal intermediate information may be erased is therefore unaddressed by any concrete test in the manuscript.
Authors: We acknowledge the concern that answer-only fine-tuning could erase cross-modal information. The original §3.2 described the pre-training loss as standard cross-entropy over interleaved trace tokens and fine-tuning as next-token prediction on final answers with a reduced learning rate (1e-5) to limit drift. To directly test retention, the revised version adds probing experiments: after fine-tuning, we train linear probes on the Mull-Tokens to reconstruct intermediate image features and text tokens from the original traces. These probes achieve 72% accuracy on image reconstruction and 81% on text, significantly above random baselines, indicating that cross-modal information is preserved. We have also clarified that no additional regularizers were used because the low learning rate and short fine-tuning schedule (3 epochs) suffice to maintain the pre-trained representations. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper describes an empirical two-stage training procedure (pre-training Mull-Tokens on supervised interleaved text-image traces, followed by fine-tuning solely on final-answer labels) and reports accuracy gains on four external public spatial-reasoning benchmarks. No equations, fitted parameters, or self-citations are invoked in a manner that reduces the claimed improvements to quantities defined or optimized inside the same training loop. The evaluation uses held-out benchmarks whose labels are independent of the training traces, satisfying the criterion for a self-contained result against external data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Latent tokens can be trained to hold useful intermediate information across image and text modalities
invented entities (1)
-
Mull-Tokens
no independent evidence
Forward citations
Cited by 7 Pith papers
-
Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?
The Token Replacement Test shows VLMs keep most accuracy gains even after corrupting or replacing continuous thought token content, indicating the tokens are not used as information bottlenecks.
-
Hybrid Latent Reasoning with Decoupled Policy Optimization
HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.
-
Forest Before Trees: Latent Superposition for Efficient Visual Reasoning
Laser reformulates visual reasoning via Dynamic Windowed Alignment Learning to maintain latent superposition of global features, delivering 5.03% average gains over Monet and over 97% fewer inference tokens on six benchmarks.
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
-
Do multimodal models imagine electric sheep?
Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
-
Semantic-Enriched Latent Visual Reasoning
SLVR enriches latent visual representations with fine-grained attribute semantics via supervised first-stage learning and multi-query alignment via M-GRPO, yielding improved robustness on region-level reasoning tasks.
Reference graph
Works this paper leans on
-
[1]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 2, 3, 14
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 3, 5, 9
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Philip J. Ball and Jakob Bauer et al. Genie 3: A new frontier for world models, 2025. 9
work page 2025
-
[5]
Per- ception tokens enhance visual reasoning in multimodal lan- guage models
Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G Shapiro, and Ranjay Krishna. Per- ception tokens enhance visual reasoning in multimodal lan- guage models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3836–3845, 2025. 1, 2, 3, 4
work page 2025
-
[6]
SIMS-V: Simulated instruction- tuning for spatial video understanding, 2025
Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, and Saining Xie. SIMS-V: Simulated instruction- tuning for spatial video understanding, 2025. 1, 3
work page 2025
-
[7]
Zikui Cai, Andrew Wang, Anirudh Satheesh, Ankit Nakhawa, Hyunwoo Jae, Keenan Powell, Minghui Liu, Neel Jay, Sungbin Oh, Xiyao Wang, et al. Morse-500: A program- matically controllable video benchmark to stress-test multi- modal reasoning.arXiv preprint arXiv:2506.05523, 2025. 1
-
[8]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Sil- vio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generalization ability in vision- language models with less than $3.https://github. com/Deep-Agent/R1-V, 2025. Accessed: 2025-02-02. 4, 6
work page 2025
-
[10]
Towards reasoning era: A survey of long chain-of-thought for reasoning large language models,
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jian- nan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models,
-
[11]
Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Yan Feng, Peng Pei, Xunliang Cai, and Ruqi Huang. Think with 3d: Geometric imagina- tion grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025. 2
-
[12]
Spatial- rgpt: Grounded spatial reasoning in vision-language mod- els
An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Rui- han Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatial- rgpt: Grounded spatial reasoning in vision-language mod- els. InAdvances in Neural Information Processing Systems, pages 135062–135093. Curran Associates, Inc., 2024. 2, 3
work page 2024
-
[13]
arXiv preprint arXiv:2407.06135
Ethan Chern, Jiadi Su, Yan Ma, and Pengfei Liu. Anole: An open, autoregressive, native large multimodal mod- els for interleaved image-text generation.arXiv preprint arXiv:2407.06135, 2024. 1
-
[14]
Thinking with gen- erated images.arXiv preprint arXiv:2505.22525, 2025
Ethan Chern, Zhulin Hu, Steffi Chern, Siqi Kou, Jiadi Su, Yan Ma, Zhijie Deng, and Pengfei Liu. Thinking with gen- erated images.arXiv preprint arXiv:2505.22525, 2025. 2
-
[15]
Mm-spatial: Exploring 3d spatial understanding in multi- modal LLMs, 2025
Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, and Peter Grasch. Mm-spatial: Exploring 3d spatial understanding in multi- modal LLMs, 2025. 3
work page 2025
-
[16]
Towards revealing the mystery behind chain of thought: A theoretical perspective
Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective. InAdvances in Neural Information Processing Systems, pages 70757– 70798. Curran Associates, Inc., 2023. 2
work page 2023
-
[17]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yib- ing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-R1: Reinforcing video reasoning in MLLMs.arXiv preprint arXiv:2503.21776, 2025. 2, 4, 5, 6, 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Stephanie Fu, Tyler Bonnen, Devin Guillory, and Trevor Darrell. Hidden in plain sight: Vlms overlook their visual representations.arXiv preprint arXiv:2506.08008, 2025. 2
-
[19]
BLINK: Multimodal large language models can see but not perceive
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. BLINK: Multimodal large language models can see but not perceive. InEuropean Conference on Com- puter Vision, pages 148–166. Springer, 2024. 2, 5
work page 2024
-
[20]
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
The delta learning hypothesis: Preference tuning on weak data can yield strong gains, 2025
Scott Geng, Hamish Ivison, Chun-Liang Li, Maarten Sap, Jerry Li, Ranjay Krishna, and Pang Wei Koh. The delta learning hypothesis: Preference tuning on weak data can yield strong gains, 2025. 6
work page 2025
-
[22]
Think before you speak: Training language models with pause tokens
Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. InIn- ternational Conference on Learning Representations (ICLR),
-
[23]
Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning
Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, and Yu 10 Cheng. Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning.arXiv preprint arXiv:2510.27492, 2025. 1, 2, 3
-
[24]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645(8081):633– 638, 2025. 2, 4, 6
work page 2025
-
[25]
Training Large Language Models to Reason in a Continuous Latent Space
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large lan- guage models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024. 3, 4, 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Visual sketchpad: Sketching as a visual chain of thought for multimodal language models, 2024
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Osten- dorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Kr- ishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.arXiv preprint arXiv:2406.09403, 2024. 1, 2, 3
-
[27]
Explain before you answer: A survey on compositional visual reasoning
Fucai Ke, Joy Hsu, Zhixi Cai, Zixian Ma, Xin Zheng, Xindi Wu, Sukai Huang, Weiqing Wang, Pari Delir Haghighi, Gholamreza Haffari, et al. Explain before you answer: A survey on compositional visual reasoning.arXiv preprint arXiv:2508.17298, 2025. 2
-
[28]
e1: Learning adaptive con- trol of reasoning effort.arXiv preprint arXiv:2510.27042,
Michael Kleinman, Matthew Trager, Alessandro Achille, Wei Xia, and Stefano Soatto. e1: Learning adaptive con- trol of reasoning effort.arXiv preprint arXiv:2510.27042,
-
[29]
MolmoAct: Action Reasoning Models that can Reason in Space
Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917, 2025. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Zebra-cot: A dataset for interleaved vision language reasoning.arXiv preprint arXiv:2507.16746, 2025
Ang Li, Charles Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, et al. Zebra-cot: A dataset for interleaved vi- sion language reasoning.arXiv preprint arXiv:2507.16746,
-
[31]
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli´c, and Furu Wei. Imag- ine while reasoning in space: Multimodal visualization-of- thought.arXiv preprint arXiv:2501.07542, 2025. 2
work page internal anchor Pith review arXiv 2025
-
[32]
Unfolding spatial cognition: Evaluating multimodal models on visual simulations
Linjie Li, Mahtab Bigverdi, Jiawei Gu, Zixian Ma, Yinuo Yang, Ziang Li, Yejin Choi, and Ranjay Krishna. Unfolding spatial cognition: Evaluating multimodal models on visual simulations.arXiv preprint arXiv:2506.04633, 2025. 1
-
[33]
Think or not think: A study of ex- plicit thinking in rule-based visual reinforcement fine-tuning
Ming Li, Jike Zhong, Shitian Zhao, Yuxiang Lai, and Kaipeng Zhang. Think or not think: A study of ex- plicit thinking in rule-based visual reinforcement fine-tuning. arXiv preprint arXiv:2503.16188, 2025. 2
-
[34]
Lost in embeddings: Information loss in vision-language models, 2025
Wenyan Li, Raphael Tang, Chengzu Li, Caiqi Zhang, Ivan Vuli´c, and Anders Søgaard. Lost in embeddings: Infor- mation loss in vision-language models.arXiv preprint arXiv:2509.11986, 2025. 2
-
[35]
Visual representations inside the language model.arXiv preprint arXiv:2510.04819,
Benlin Liu, Amita Kamath, Madeleine Grunde-McLaughlin, Winson Han, and Ranjay Krishna. Visual representations in- side the language model.arXiv preprint arXiv:2510.04819,
-
[36]
Deconstructing spatial intelligence in vision-language models, 2025
Disheng Liu, Tuo Liang, Zhe Hu, Jierui Peng, Yiren Lu, Yi Xu, Yun Fu, and Yu Yin. Deconstructing spatial intelligence in vision-language models, 2025. 2
work page 2025
-
[37]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, pages 34892–34916. Curran Associates, Inc., 2023. 2, 3
work page 2023
-
[38]
Wu, Ilia Sucholutsky, Tania Lombrozo, and Thomas L
Ryan Liu, Jiayi Geng, Addison J. Wu, Ilia Sucholutsky, Tania Lombrozo, and Thomas L. Griffiths. Mind your step (by step): Chain-of-thought can reduce performance on tasks where thinking makes humans worse, 2025. 1
work page 2025
-
[39]
Hyper-bagel: A unified acceleration framework for multimodal understand- ing and generation, 2025
Yanzuo Lu, Xin Xia, Manlin Zhang, Huafeng Kuang, Jian- bin Zheng, Yuxi Ren, and Xuefeng Xiao. Hyper-bagel: A unified acceleration framework for multimodal understand- ing and generation, 2025. 2
work page 2025
-
[40]
When thinking drifts: Evidential grounding for robust video reasoning, 2025
Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman. When thinking drifts: Evidential grounding for robust video reasoning, 2025. 2, 3, 5, 13
work page 2025
-
[41]
Taco: Learning multi-modal action models with synthetic chains-of-thought-and-action, 2024
Zixian Ma, Jianguo Zhang, Zhiwei Liu, Jieyu Zhang, Jun- tao Tan, Manli Shu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Caiming Xiong, Ranjay Krishna, and Silvio Savarese. Taco: Learning multi-modal action models with synthetic chains-of-thought-and-action, 2024. 1
work page 2024
-
[42]
Hanspeter A Mallot.From geometry to behavior: An intro- duction to spatial cognition. MIT Press, 2024. 1
work page 2024
-
[43]
Tips: Text- image pretraining with spatial awareness, 2025
Kevis-Kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Ar- jun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, Dan Gnanapragasam, Mojtaba Seyedhosseini, Howard Zhou, and Andre Araujo. Tips: Text- image pretraining with spatial awareness, 2025. 3
work page 2025
-
[44]
Thinking with images.https://openai
OpenAI. Thinking with images.https://openai. com/index/thinking-with-images/, 2025. 1, 2
work page 2025
-
[45]
Vision language models are blind
Pooyan Rahmanzadehgervi, Logan Bolton, Moham- mad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind. InProceedings of the Asian Conference on Computer Vision (ACCV), pages 18–34, 2024. 2
work page 2024
-
[46]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable train- ing deep learning models with over 100 billion parame- ters. InProceedings of the 26th ACM SIGKDD Interna- tional Conference on Knowledge Discovery & Data Mining, page 3505–3506, New York, NY , USA, 2020. Association for Computing Machinery. 5, 13
work page 2020
-
[47]
Plummer, Ranjay Krishna, and Kate Saenko
Arijit Ray, Filip Radenovic, Abhimanyu Dubey, Bryan A. Plummer, Ranjay Krishna, and Kate Saenko. Cola: A bench- mark for compositional text-to-image retrieval, 2023. 2
work page 2023
-
[48]
Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko
Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, Kuo-Hao Zeng, et al. Sat: Spa- tial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755, 2024. 1, 2, 3, 5, 6, 13
-
[49]
CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation
Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation.arXiv preprint arXiv:2502.21074, 2025. 2, 3
work page internal anchor Pith review arXiv 2025
-
[50]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar. Scaling LLM test-time compute optimally can be more 11 effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Emu: Generative pretraining in multimodality
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Emu: Generative pretraining in multimodality. InThe Twelfth International Conference on Learning Representations, 2024. 2
work page 2024
-
[52]
Renliang Sun, Wei Cheng, Dawei Li, Haifeng Chen, and Wei Wang. Stop when enough: Adaptive early- stopping for chain-of-thought reasoning.arXiv preprint arXiv:2510.10103, 2025. 2
-
[53]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Cvbench: A benchmark for cross-video multimodal reasoning, 2025
CVBench Team. Cvbench: A benchmark for cross-video multimodal reasoning, 2025. 2
work page 2025
-
[55]
Gemini Robotics: Bringing AI into the Physical World
Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Are- nas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025. 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs,
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs,
-
[57]
Barbara Tversky and Masaki Suwa.Thinking with sketches, pages 75–84. Oxford University Press, 2009. 1
work page 2009
-
[58]
Marina Vasilyeva and Stella F Lourenco. Development of spatial cognition.Wiley Interdisciplinary Reviews: Cognitive Science, 3(3):349–362, 2012. 1
work page 2012
-
[59]
Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. Is a picture worth a thou- sand words? delving into spatial reasoning for vision lan- guage models, 2024. 1
work page 2024
-
[60]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Pro- cessing Systems, pages 24824–24837. Curran Associates, Inc., 2022. 2
work page 2022
-
[61]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 1
work page 2022
-
[62]
Stop spinning wheels: Mitigating LLM overthinking via mining patterns for early reasoning exit, 2025
Zihao Wei, Liang Pang, Jiahao Liu, Jingcheng Deng, Shicheng Xu, Zenghao Duan, Jingang Wang, Fei Sun, Xun- liang Cai, Huawei Shen, and Xueqi Cheng. Stop spinning wheels: Mitigating LLM overthinking via mining patterns for early reasoning exit, 2025. 8
work page 2025
-
[63]
Janus: Decoupling visual encoding for unified multimodal understanding and genera- tion
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. Janus: Decoupling visual encoding for unified multimodal understanding and genera- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 12966– 12977, 2025. 2
work page 2025
-
[64]
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model inte- grating visual understanding and generation.arXiv preprint arXiv:2409.04429, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[65]
Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie
Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10632–10643, 2025. 1, 2, 3, 5
work page 2025
-
[67]
Cambrian-s: Towards spatial super- sensing in video, 2025
Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zi- hao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei- Fei, and Saining Xie. Cambrian-s: Towards spatial super- sensing in video, 2025. 2, 3
work page 2025
-
[68]
Machine mental imagery: Empower multi- modal reasoning with latent visual tokens, 2025
Zeyuan Yang, Xueyang Yu, Delin Chen, Maohao Shen, and Chuang Gan. Machine mental imagery: Empower multi- modal reasoning with latent visual tokens, 2025. 1, 2, 3, 4, 5, 6, 8, 9, 13
work page 2025
-
[69]
Hybrid latent reasoning via reinforcement learning.arXiv preprint arXiv:2505.18454, 2025
Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, and Dong Wang. Hybrid latent reasoning via reinforcement learning.arXiv preprint arXiv:2505.18454, 2025. 2
-
[70]
DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning
Chi Zhang, Haibo Qiu, Qiming Zhang, Zhixiong Zeng, Lin Ma, and Jing Zhang. Deepsketcher: Internalizing vi- sual manipulation for multimodal reasoning.arXiv preprint arXiv:2509.25866, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[71]
Lmms- eval: Accelerating the development of large multimoal mod- els, 2024
Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms- eval: Accelerating the development of large multimoal mod- els, 2024. 5
work page 2024
-
[72]
Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Jiakui Hu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, et al. Unified multimodal under- standing and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567, 2025. 2
-
[73]
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought rea- soning in language models.Transactions on Machine Learn- ing Research, 2024. 2
work page 2024
-
[74]
Cot-vla: Visual chain-of-thought reasoning for vision-language-action mod- els
Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Tsung-Yi Lin, Gordon Wet- zstein, Ming-Yu Liu, and Donglai Xiang. Cot-vla: Visual chain-of-thought reasoning for vision-language-action mod- els. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern R...
work page 2025
-
[75]
Xu Zheng, Zihao Dongfang, Lutao Jiang, Boyuan Zheng, Yulong Guo, Zhenquan Zhang, Giuliano Albanese, Runyi 12 Yang, Mengjiao Ma, Zixin Zhang, et al. Multimodal spatial reasoning in the large model era: A survey and benchmarks. https://arxiv.org/abs/2510.25760, 2025. 2
-
[76]
Chenyue Zhou, Mingxuan Wang, Yanbiao Ma, Chenxu Wu, Wanyi Chen, Zhe Qian, Xinyu Liu, Yiwei Zhang, Jun- hao Wang, Hengbo Xu, et al. From perception to cog- nition: A survey of vision-language interactive reason- ing in multimodal large language models.arXiv preprint arXiv:2509.25373, 2025. 2
-
[77]
Qiji Zhou, Ruochen Zhou, Zike Hu, Panzhong Lu, Siyang Gao, and Yue Zhang. Image-of-thought prompting for visual reasoning refinement in multimodal large language models. arXiv preprint arXiv:2405.13872, 2024. 1
-
[78]
Scaling Latent Reasoning via Looped Language Models
Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, et al. Scaling latent reasoning via looped language models. arXiv preprint arXiv:2510.25741, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[79]
Please provide only the single op- tion letter
Appendix In this supplementary document, we include further abla- tions in the training design choices, more details, and more insights into the shortcomings of related existing work ver- sus our approach using qualitative examples. Finally, we also provide some qualitative examples to demonstrate our insights. 6.1. Training Details We train using Deepspe...
-
[80]
Text-Reasoning Baseline (Video-R1). To reproduce text-based reasoning baselines, we utilize the template established in prior work [17, 40, 68]. {Question} Please think about this question as if you were a human pondering deeply. Engage in an internal dialogue using expressions such as ’let me think’, ’wait’, ’Hmm’, ’oh, I see’, ’let’s break it down’, etc...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.