How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning

Aishwarya Agrawal; Ankur Sikarwar; Huy Le; Le Zhang; Perouz Taslakian; Qian Yang; Zhuan Shi

arxiv: 2605.27310 · v1 · pith:I5Q2GXT6new · submitted 2026-05-26 · 💻 cs.CV

How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning

Qian Yang , Ankur Sikarwar , Huy Le , Le Zhang , Zhuan Shi , Perouz Taslakian , Aishwarya Agrawal This is my paper

Pith reviewed 2026-06-29 18:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords cross-view spatial reasoningvisual thinkingunified multimodal modelsView Dropoutpanoramic thinking imagesout-of-domain generalizationsynthetic training data

0 comments

The pith

Panoramic thinking images combined with View Dropout let unified multimodal models rely on generated visuals for cross-view spatial reasoning instead of language alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks how to ensure that intermediate thinking images actually influence answers in unified multimodal models rather than being ignored. It introduces View Dropout, a training intervention that masks parts of an input view from the answer tokens while leaving them visible to the thinking-image tokens, pushing the model to consult the generated image. Among three rendering styles trained on synthetic scenes, only panoramic thinking images prove both informative enough to capture necessary geometry and learnable enough to produce accurate traces, delivering the strongest results on five real-world out-of-domain benchmarks.

Core claim

Panoramic visual thinking with VDrop is the only configuration that is both informative and learnable, and it achieves the best out-of-domain generalization.

What carries the argument

View Dropout (VDrop), a training-time intervention that hides parts of one input view from the answer span while keeping them visible to the thinking-image tokens.

If this is right

Training with VDrop prevents models from defaulting to language-only reasoning and forces use of the generated thinking image.
Panoramic renderings supply the right amount of geometric context without exceeding what the model can reliably produce during generation.
Top-down and point-matching renderings are either insufficiently informative or too difficult to generate accurately from the input views.
Synthetic scene training transfers to real-world tasks once the thinking-image type satisfies both learnability and informativeness.
The same training intervention and rendering comparison can be applied to other spatial reasoning benchmarks that require fine-grained geometry.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The learnability-informativeness tradeoff may generalize to other intermediate visual representations beyond spatial reasoning.
Models might benefit from dynamically selecting the thinking-image style based on scene complexity rather than fixing one type.
Extending VDrop to hide varying fractions of the input view could reveal the minimal amount of masking needed to enforce visual reliance.
The approach suggests that future unified models could interleave multiple thinking images of different styles within a single reasoning trace.

Load-bearing premise

Forcing attention to the thinking image via View Dropout on synthetic data will produce genuine reliance on its visual evidence rather than new spurious correlations, and the synthetic-to-real gap will not invalidate the observed learnability-informativeness tradeoff.

What would settle it

A controlled attention-map comparison on held-out real scenes showing that the model still attends primarily to the original input views rather than the generated thinking image even after VDrop training, or a result where point-matching or top-down renderings outperform panoramic ones on the real benchmarks.

Figures

Figures reproduced from arXiv: 2605.27310 by Aishwarya Agrawal, Ankur Sikarwar, Huy Le, Le Zhang, Perouz Taslakian, Qian Yang, Zhuan Shi.

**Figure 1.** Figure 1: Visual thinking for cross-view spatial reasoning. Given two input views and a cross-view spatial question (left), a UMM can generate one of three intermediate thinking-image types (middle) before answering: panorama, point matching, or top-down. Right: without View Dropout, the answer pathway takes a shortcut through the input views, leaving the generated thinking-image unused; with View Dropout, part of o… view at source ↗

**Figure 2.** Figure 2: VDrop attention mask. Answer queries Qa cannot attend to the masked region (red hatched), while thinking-image queries Qvt retain full access to all. Recent analyses (Liu et al., 2025b) report that predictions remain nearly unchanged under visual intervention, indicating that the visual evidence in the thinking-image is largely ignored. Method overview. To force the thinking-image to be a load-bearing co… view at source ↗

**Figure 3.** Figure 3: Generate-then-blind probe across 4 OOD benchmarks. Accuracy drop when the generated thinking-image is blinded at answer time; a larger drop means more dependence on the thinking-image. VDroptrained models show larger drops on three benchmarks. out consulting it. A model that genuinely uses the thinking-image should lose accuracy under blinding; one that ignores it should be unaffected. We apply this prob… view at source ↗

**Figure 5.** Figure 5: Generate-then-blind probe on MMSI, by question evidence category. Accuracy drop when the generated thinking-image is blinded at answer time; a larger positive value means more dependence on the thinking-image. The VDrop-trained model shows a large drop only on Measurement, whose questions are answered by visually aligning the two input views, while standard SFT is unaffected throughout. tribute to the gene… view at source ↗

**Figure 6.** Figure 6: Mean answer-token attention on thinkingimage tokens across decoder layers (STARE). The VDrop-trained model places more attention on the generated thinking-image than the standard SFT model, especially in early and mid layers, indicating that VDrop shifts the answer pathway toward the thinking-image. into the decoding stream, after which it produces a thinking-image and then an answer. This makes the thi… view at source ↗

**Figure 7.** Figure 7: Qualitative examples of visual thinking across strategies. Four samples, one per subtask (Anchor, Counting, Relative Distance, Relative Direction). Each row shows the question and four options (gold option in green), followed by the two input camera views and the generated thinking-image under each strategy (Panoramic, Point Matching, Top-down View). The predicted answer letter and correctness (✓ / ✗) are … view at source ↗

read the original abstract

Cross-view spatial reasoning remains a weak spot for vision-language models (VLMs): they often reason in language and lose the fine-grained geometry needed for the task. Thinking with images aims to address this by generating an intermediate thinking image, but recent work shows that models often ignore the visual evidence in these traces. We therefore ask how to make visual thinking matter, and what kind of visual thinking works best. We study these questions in unified multimodal models (UMMs), which natively support interleaved image-text generation. For the first question, we propose View Dropout (VDrop), a training-time intervention that hides parts of one input view from the answer span while keeping them visible to the thinking-image tokens. This encourages the model to use the thinking image when answering, instead of relying only on the input views. Once the thinking image is used for answer prediction, we study which type of visual thinking is most effective. We frame this as a learnability-informativeness tradeoff and compare three thinking-image variants: top-down, panoramic, and point-matching renderings. Trained on synthetic scenes and evaluated on five real-world out-of-domain benchmarks, panoramic visual thinking with VDrop is the only configuration that is both informative and learnable, and it achieves the best out-of-domain generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

View Dropout pushes models to use generated thinking images for spatial reasoning and panoramic variants win on real benchmarks, but the mechanism may still allow shortcuts.

read the letter

The paper's core move is View Dropout: during training on synthetic scenes it masks regions of an input view from the final answer tokens while leaving them visible to the tokens that generate the thinking image. This is meant to stop the model from ignoring its own generated image. They then compare three thinking-image styles—top-down, panoramic, and point-matching—under a learnability-versus-informativeness framing and report that only panoramic plus VDrop both trains well and transfers to five real-world out-of-domain benchmarks.

The intervention itself is simple and directly targets a documented failure mode where VLMs generate thinking images but do not consult them. Framing the image-type comparison as a tradeoff is useful, and the decision to train only on synthetic data while testing on real scenes is a reasonable attempt at measuring generalization.

The main uncertainty is whether VDrop actually forces geometric use of the thinking image or merely creates new token-level correlations between the generated image and the answer. Because the masking is asymmetric and all training data is synthetic, it is possible the model learns domain-specific shortcuts that happen to survive the shift to real benchmarks. The abstract gives no ablations on the masking schedule, no statistical tests, and no analysis of what the model actually attends to, so it is hard to tell how much of the reported gain is genuine visual reasoning.

This is worth a serious referee for groups working on spatial reasoning in unified multimodal models. The idea is concrete enough that reviewers can check the mechanism and the OOD numbers directly. I would bring it to a reading group to see the full experimental details.

Referee Report

2 major / 2 minor

Summary. The paper claims that cross-view spatial reasoning in unified multimodal models (UMMs) can be improved by generating intermediate thinking images, but models often ignore the visual content. It introduces View Dropout (VDrop), a training intervention that hides input-view regions from the answer span while keeping them visible to thinking-image tokens, to encourage reliance on the generated image. It then compares three thinking-image variants (top-down, panoramic, point-matching) under a learnability-informativeness tradeoff, trained on synthetic scenes and evaluated on five real-world out-of-domain benchmarks. The central result is that only panoramic visual thinking combined with VDrop is both informative and learnable, yielding the best OOD generalization.

Significance. If the central result holds, the work supplies a practical training mechanism (VDrop) and an evaluation framework for making visual thinking effective rather than decorative in VLMs. The synthetic-to-real OOD setup and explicit tradeoff analysis are strengths that could guide future work on intermediate visual representations for spatial tasks. The finding that only one configuration succeeds provides a falsifiable prediction for follow-up studies.

major comments (2)

[Method section on View Dropout] Method section on View Dropout: the mechanism hides input regions only from the answer span while leaving them visible to thinking-image tokens. No ablation is described that tests whether performance drops when the thinking image is removed at inference (or when its content is corrupted), which is required to establish that the model is actually extracting geometric evidence rather than learning new token-answer shortcuts. This is load-bearing for the claim that VDrop produces genuine visual reliance and for the learnability-informativeness tradeoff.
[Results section reporting OOD benchmark performance] Results section reporting OOD benchmark performance: the claim that panoramic+VDrop is the only informative+learnable configuration and achieves best generalization rests on the assumption that the synthetic-to-real gap does not introduce domain-specific shortcuts. No analysis (e.g., feature attribution or controlled corruption of the thinking image on real benchmarks) is provided to rule out that the observed gains survive distribution shift for reasons other than visual reasoning.

minor comments (2)

[Introduction and Method] The definitions of 'informative' and 'learnable' are introduced in the abstract and method but would benefit from explicit operationalization (e.g., quantitative thresholds or equations) early in the paper for reproducibility.
[Figures] Figure captions for the thinking-image variants could include example renderings side-by-side with input views to make the differences between top-down, panoramic, and point-matching immediately clear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address the two major comments below and will incorporate additional experiments to strengthen the evidence for visual reliance and OOD generalization.

read point-by-point responses

Referee: [Method section on View Dropout] Method section on View Dropout: the mechanism hides input regions only from the answer span while leaving them visible to thinking-image tokens. No ablation is described that tests whether performance drops when the thinking image is removed at inference (or when its content is corrupted), which is required to establish that the model is actually extracting geometric evidence rather than learning new token-answer shortcuts. This is load-bearing for the claim that VDrop produces genuine visual reliance and for the learnability-informativeness tradeoff.

Authors: We agree that an inference-time ablation removing or corrupting the thinking image is needed to directly confirm reliance on its geometric content rather than token shortcuts. Although VDrop is explicitly designed to make thinking-image tokens the only source for hidden regions during training, the manuscript does not report such controls at inference. We will add these ablations (both removal and corruption) for all three thinking-image variants on both synthetic and real benchmarks in the revision. revision: yes
Referee: [Results section reporting OOD benchmark performance] Results section reporting OOD benchmark performance: the claim that panoramic+VDrop is the only informative+learnable configuration and achieves best generalization rests on the assumption that the synthetic-to-real gap does not introduce domain-specific shortcuts. No analysis (e.g., feature attribution or controlled corruption of the thinking image on real benchmarks) is provided to rule out that the observed gains survive distribution shift for reasons other than visual reasoning.

Authors: The five real-world OOD benchmarks already demonstrate consistent gains for panoramic+VDrop, which we interpret as evidence against purely synthetic shortcuts. We nevertheless acknowledge that explicit controls such as feature attribution or thinking-image corruption on the real benchmarks would further isolate visual reasoning as the source of improvement. We will add these analyses in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison of training interventions on held-out benchmarks

full rationale

The paper introduces View Dropout as a training-time masking intervention and evaluates three thinking-image variants (top-down, panoramic, point-matching) by training unified multimodal models on synthetic scenes then measuring performance on five real-world out-of-domain benchmarks. The central claim—that panoramic+VDrop is the only informative+learnable configuration with best generalization—is presented as an observed experimental outcome rather than a quantity derived by definition or by fitting a parameter to the target metric. No equations, self-citations, or ansatzes are invoked that reduce the reported result to its own inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5782 in / 1012 out tokens · 34573 ms · 2026-06-29T18:16:19.202038+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 13 canonical work pages · 7 internal anchors

[1]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, and 45 others. 2025. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, and 10 others. 2026. Scaling spatial intelligence with multimodal foundation models. In Proceedings of the IEEE/CVF Conferenc...

2026
[3]

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. 2024. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455--14465

2024
[4]

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. 2025 a . Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Xiang An, Yan Feng, Peng Pei, Xunliang Cai, and 1 others. 2025 b . Think with 3d: Geometric imagination grounded spatial reasoning from limited views. arXiv preprint arXiv:2510.18632

work page arXiv 2025
[6]

Zihui Cheng, Qiguang Chen, Xiao Xu, Jiaqi Wang, Weiyun Wang, Hao Fei, Yidong Wang, Alex Jinpeng Wang, Zhi Chen, Wanxiang Che, and 1 others. 2026. Visual thoughts: A unified perspective of understanding multimodal chain-of-thought. Advances in Neural Information Processing Systems, 38:96084--96112

2026
[7]

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. 2025. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye, Wenwen Tong, Xiangyu Fan, and 1 others. 2026. Sensenova-u1: Unifying multimodal understanding and generation with neo-unify architecture. arXiv preprint arXiv:2605.12500

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. 2024. Blink: Multimodal large language models can see but not perceive. In European Conference on Computer Vision, pages 148--166. Springer

2024
[10]

Simon Garrod and Anthony Anderson. 1987. Saying what you mean in dialogue: A study in conceptual and semantic co-ordination. Cognition, 27(2):181--218

1987
[11]

Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, and Yu Cheng. 2026. https://openreview.net/forum?id=mB3vxfrQZM Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning . In The Fourteenth International Conference on Learning Representations

2026
[12]

Leekyeung Han, Hyunji Min, Gyeom Hwangbo, Jonghyun Choi, and Paul Hongsuck Seo. 2025. Dialnav: Multi-turn dialog navigation with a remote guide. In IEEE/CVF International Conference on Computer Vision, ICCV 2025, Honolulu, HI, USA, October 19-25, 2025 , pages 8514--8523. IEEE

2025
[13]

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. 2024. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. Advances in Neural Information Processing Systems, 37:139348--139379

2024
[14]

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, XinQiang Yu, Jiawei He, He Wang, and Li Yi. 2026. https://openreview.net/forum?id=6nZKT2rL0H Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models . In The Fourteenth International Conference on Learning Representations

2026
[15]

Stephen C Levinson. 2003. Space in language and cognition: Explorations in cognitive diversity, volume 5. Cambridge University Press

2003
[16]

Ang Li, Charles Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, Tom Goldstein, and Micah Goldblum. 2026 a . https://openreview.net/forum?id=c6XIVI3TiQ Zebra-cot: A dataset for interleaved vision-language reasoning . In The Fourteenth International Conference on Learning Representations

2026
[17]

Linjie Li, Mahtab Bigverdi, Jiawei Gu, Zixian Ma, Yinuo Yang, Ziang Li, Yejin Choi, and Ranjay Krishna. 2026 b . https://openreview.net/forum?id=fbGmSV6tUw Unfolding spatial cognition: Evaluating multimodal models on visual simulations . In The Fourteenth International Conference on Learning Representations

2026
[18]

Zhiheng Liu, Weiming Ren, Xiaoke Huang, Shoufa Chen, Tianhong Li, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Tao Xiang, Wenhu Chen, Ping Luo, Luke Zettlemoyer, and Yuren Cong. 2026. Tuna-2: Pixel embeddings beat vision encoders for unified understanding and generation. arXiv preprint arXiv:2604.24763

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, Viktar Atliha, Tony Ng, Xiao Han, Chuyan Zhu, Chenyang Zhang, Ding Liu, Juan-Manuel Perez-Rua, Sen He, Jürgen Schmidhuber, and 6 others. 2025 a . https://arxiv.org/abs/2512.02014 Tuna: Taming unified visual representations for ...

work page arXiv 2025
[20]

Zujing Liu, Junwen Pan, Qi She, Yuan Gao, and Guisong Xia. 2025 b . On the faithfulness of visual thinking: Measurement and enhancement. arXiv preprint arXiv:2510.23482

work page arXiv 2025
[21]

Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, Zeyu Ma, and Jia Deng. 2024. Infinigen indoors: Photorealistic indoor scenes using procedural generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21783--21794

2024
[22]

Ankur Sikarwar, Debangan Mishra, Sudarshan Nikhil, Ponnurangam Kumaraguru, and Aishwarya Agrawal. 2026. Communicating about space: Language-mediated spatial integration across partial views. arXiv preprint arXiv:2603.27183

work page arXiv 2026
[23]

Anh Thai, Songyou Peng, Kyle Genova, Leonidas Guibas, and Thomas Funkhouser. 2025. Splattalk: 3d vqa with gaussian splatting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4712--4721

2025
[24]

Barbara Tversky. 2003. Structures of mental spaces: How people think about space. Environment and behavior, 35(1):66--80

2003
[25]

Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Jiajun Wu, Li Fei-Fei, and Manling Li. 2026. https://openreview.net/forum?id=0FhrtdKLtD Mindcube: Spatial mental modeling from limited views . In The Fourteenth International Conference on Learning...

2026
[26]

Yipu Wang, Yuheng Ji, Yuyang Liu, Enshen Zhou, Ziqiang Yang, Yuxuan Tian, Ziheng Qin, Yue Liu, Huajie Tan, Cheng Chi, Zhiyuan Ma, Daniel Dajun Zeng, and Xiaolong Zheng. 2025. https://arxiv.org/abs/2512.04686 Towards cross-view point correspondence in vision-language models . Preprint, arXiv:2512.04686

work page arXiv 2025
[27]

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and 1 others. 2025. Janus: Decoupling visual encoding for unified multimodal understanding and generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12966--12977

2025
[28]

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. 2026. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence. Advances in neural information processing systems, 38:13569--13597

2026
[29]

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. 2025. Show-o: One single transformer to unify multimodal understanding and generation. In International Conference on Learning Representations, volume 2025, pages 28240--28264

2025
[30]

Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, and Ivan Vuli \'c . 2026. https://openreview.net/forum?id=wsnse46kRO Visual planning: Let's think only with images . In The Fourteenth International Conference on Learning Representations

2026
[31]

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. 2026 a . https://openreview.net/forum?id=gHRoX4vXm3 MMSI -bench: A benchmark for multi-image spatial intelligence . In The Fourteenth International Conference on Learning Representations

2026
[32]

Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, and Chuang Gan. 2026 b . Mindjourney: Test-time scaling with world models for spatial reasoning. Advances in Neural Information Processing Systems, 38:109855--109885

2026
[33]

Shoubin Yu, Yue Zhang, Zun Wang, Jaehong Yoon, Huaxiu Yao, Mingyu Ding, and Mohit Bansal. 2026. When and how much to imagine: Adaptive test-time scaling with world models for visual spatial reasoning. arXiv preprint arXiv:2602.08236

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Le Zhang, Jihan Yang, Soundarya Krishnan, Jimit Majmudar, Xiou Ge, Prasoon Puri, Prathamesh Nandkishor Saraf, Shruti Bhargava, Dhivya Piraviperumal, Yinan Ling, and 1 others. 2026 a . From where things are to what they are for: Benchmarking spatial-functional intelligence in multimodal llms. arXiv preprint arXiv:2605.02130

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, and 1 others. 2026 b . Think3d: Thinking with space for spatial reasoning. arXiv preprint arXiv:2601.13029

work page arXiv 2026
[36]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[37]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

[1] [1]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, and 45 others. 2025. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, and 10 others. 2026. Scaling spatial intelligence with multimodal foundation models. In Proceedings of the IEEE/CVF Conferenc...

2026

[3] [3]

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. 2024. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455--14465

2024

[4] [4]

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. 2025 a . Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Xiang An, Yan Feng, Peng Pei, Xunliang Cai, and 1 others. 2025 b . Think with 3d: Geometric imagination grounded spatial reasoning from limited views. arXiv preprint arXiv:2510.18632

work page arXiv 2025

[6] [6]

Zihui Cheng, Qiguang Chen, Xiao Xu, Jiaqi Wang, Weiyun Wang, Hao Fei, Yidong Wang, Alex Jinpeng Wang, Zhi Chen, Wanxiang Che, and 1 others. 2026. Visual thoughts: A unified perspective of understanding multimodal chain-of-thought. Advances in Neural Information Processing Systems, 38:96084--96112

2026

[7] [7]

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. 2025. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye, Wenwen Tong, Xiangyu Fan, and 1 others. 2026. Sensenova-u1: Unifying multimodal understanding and generation with neo-unify architecture. arXiv preprint arXiv:2605.12500

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. 2024. Blink: Multimodal large language models can see but not perceive. In European Conference on Computer Vision, pages 148--166. Springer

2024

[10] [10]

Simon Garrod and Anthony Anderson. 1987. Saying what you mean in dialogue: A study in conceptual and semantic co-ordination. Cognition, 27(2):181--218

1987

[11] [11]

Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, and Yu Cheng. 2026. https://openreview.net/forum?id=mB3vxfrQZM Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning . In The Fourteenth International Conference on Learning Representations

2026

[12] [12]

Leekyeung Han, Hyunji Min, Gyeom Hwangbo, Jonghyun Choi, and Paul Hongsuck Seo. 2025. Dialnav: Multi-turn dialog navigation with a remote guide. In IEEE/CVF International Conference on Computer Vision, ICCV 2025, Honolulu, HI, USA, October 19-25, 2025 , pages 8514--8523. IEEE

2025

[13] [13]

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. 2024. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. Advances in Neural Information Processing Systems, 37:139348--139379

2024

[14] [14]

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, XinQiang Yu, Jiawei He, He Wang, and Li Yi. 2026. https://openreview.net/forum?id=6nZKT2rL0H Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models . In The Fourteenth International Conference on Learning Representations

2026

[15] [15]

Stephen C Levinson. 2003. Space in language and cognition: Explorations in cognitive diversity, volume 5. Cambridge University Press

2003

[16] [16]

Ang Li, Charles Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, Tom Goldstein, and Micah Goldblum. 2026 a . https://openreview.net/forum?id=c6XIVI3TiQ Zebra-cot: A dataset for interleaved vision-language reasoning . In The Fourteenth International Conference on Learning Representations

2026

[17] [17]

Linjie Li, Mahtab Bigverdi, Jiawei Gu, Zixian Ma, Yinuo Yang, Ziang Li, Yejin Choi, and Ranjay Krishna. 2026 b . https://openreview.net/forum?id=fbGmSV6tUw Unfolding spatial cognition: Evaluating multimodal models on visual simulations . In The Fourteenth International Conference on Learning Representations

2026

[18] [18]

Zhiheng Liu, Weiming Ren, Xiaoke Huang, Shoufa Chen, Tianhong Li, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Tao Xiang, Wenhu Chen, Ping Luo, Luke Zettlemoyer, and Yuren Cong. 2026. Tuna-2: Pixel embeddings beat vision encoders for unified understanding and generation. arXiv preprint arXiv:2604.24763

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, Viktar Atliha, Tony Ng, Xiao Han, Chuyan Zhu, Chenyang Zhang, Ding Liu, Juan-Manuel Perez-Rua, Sen He, Jürgen Schmidhuber, and 6 others. 2025 a . https://arxiv.org/abs/2512.02014 Tuna: Taming unified visual representations for ...

work page arXiv 2025

[20] [20]

Zujing Liu, Junwen Pan, Qi She, Yuan Gao, and Guisong Xia. 2025 b . On the faithfulness of visual thinking: Measurement and enhancement. arXiv preprint arXiv:2510.23482

work page arXiv 2025

[21] [21]

Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, Zeyu Ma, and Jia Deng. 2024. Infinigen indoors: Photorealistic indoor scenes using procedural generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21783--21794

2024

[22] [22]

Ankur Sikarwar, Debangan Mishra, Sudarshan Nikhil, Ponnurangam Kumaraguru, and Aishwarya Agrawal. 2026. Communicating about space: Language-mediated spatial integration across partial views. arXiv preprint arXiv:2603.27183

work page arXiv 2026

[23] [23]

Anh Thai, Songyou Peng, Kyle Genova, Leonidas Guibas, and Thomas Funkhouser. 2025. Splattalk: 3d vqa with gaussian splatting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4712--4721

2025

[24] [24]

Barbara Tversky. 2003. Structures of mental spaces: How people think about space. Environment and behavior, 35(1):66--80

2003

[25] [25]

Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Jiajun Wu, Li Fei-Fei, and Manling Li. 2026. https://openreview.net/forum?id=0FhrtdKLtD Mindcube: Spatial mental modeling from limited views . In The Fourteenth International Conference on Learning...

2026

[26] [26]

Yipu Wang, Yuheng Ji, Yuyang Liu, Enshen Zhou, Ziqiang Yang, Yuxuan Tian, Ziheng Qin, Yue Liu, Huajie Tan, Cheng Chi, Zhiyuan Ma, Daniel Dajun Zeng, and Xiaolong Zheng. 2025. https://arxiv.org/abs/2512.04686 Towards cross-view point correspondence in vision-language models . Preprint, arXiv:2512.04686

work page arXiv 2025

[27] [27]

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and 1 others. 2025. Janus: Decoupling visual encoding for unified multimodal understanding and generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12966--12977

2025

[28] [28]

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. 2026. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence. Advances in neural information processing systems, 38:13569--13597

2026

[29] [29]

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. 2025. Show-o: One single transformer to unify multimodal understanding and generation. In International Conference on Learning Representations, volume 2025, pages 28240--28264

2025

[30] [30]

Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, and Ivan Vuli \'c . 2026. https://openreview.net/forum?id=wsnse46kRO Visual planning: Let's think only with images . In The Fourteenth International Conference on Learning Representations

2026

[31] [31]

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. 2026 a . https://openreview.net/forum?id=gHRoX4vXm3 MMSI -bench: A benchmark for multi-image spatial intelligence . In The Fourteenth International Conference on Learning Representations

2026

[32] [32]

Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, and Chuang Gan. 2026 b . Mindjourney: Test-time scaling with world models for spatial reasoning. Advances in Neural Information Processing Systems, 38:109855--109885

2026

[33] [33]

Shoubin Yu, Yue Zhang, Zun Wang, Jaehong Yoon, Huaxiu Yao, Mingyu Ding, and Mohit Bansal. 2026. When and how much to imagine: Adaptive test-time scaling with world models for visual spatial reasoning. arXiv preprint arXiv:2602.08236

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [34]

Le Zhang, Jihan Yang, Soundarya Krishnan, Jimit Majmudar, Xiou Ge, Prasoon Puri, Prathamesh Nandkishor Saraf, Shruti Bhargava, Dhivya Piraviperumal, Yinan Ling, and 1 others. 2026 a . From where things are to what they are for: Benchmarking spatial-functional intelligence in multimodal llms. arXiv preprint arXiv:2605.02130

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, and 1 others. 2026 b . Think3d: Thinking with space for spatial reasoning. arXiv preprint arXiv:2601.13029

work page arXiv 2026

[36] [36]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

[37] [37]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...