pith. machine review for the scientific record. sign in

arxiv: 2604.17385 · v1 · submitted 2026-04-19 · 💻 cs.CV

Recognition: unknown

SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords spatial reasoningvisual imaginationmultimodal large language modelschain-of-thought reasoningspatial intelligencestate trackingdata engineclosed-loop verification
0
0 comments X

The pith

Multimodal models achieve more reliable spatial reasoning by selectively generating visual images to track geometric states during text-based planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models often fail at spatial tasks because text reasoning discards the precise geometric details needed to maintain consistent states across steps. SpatialImaginer counters this by splitting the process: text chain-of-thought handles high-level semantic plans while visual imagination performs the actual geometry updates and consistency checks. A difficulty-aware data engine with closed-loop verification teaches the model when to trigger imagination rather than relying on text alone. Experiments across spatial intelligence benchmarks confirm higher accuracy and greater robustness, especially on multi-step problems that require repeated state tracking.

Core claim

The paper establishes that effective spatial reasoning requires low-level geometric structure to be preserved and updated throughout the process, which text representations alone cannot achieve reliably. By adopting a divide-and-conquer framework that pairs textual planning with visual imagination for state transformations, and training this behavior via a targeted data engine, the model learns to invoke visual generation only when needed for stable tracking.

What carries the argument

The divide-and-conquer strategy that delegates high-level semantic planning to text chain-of-thought while reserving visual imagination for geometry-sensitive state transformation and consistency preservation.

If this is right

  • The model maintains geometric details that text abstractions normally lose, enabling longer and more stable reasoning chains.
  • Selective use of visual imagination occurs only on tasks requiring stable spatial state tracking, avoiding unnecessary computation on simpler problems.
  • Unified multimodal generation produces both text plans and visual state updates within a single framework.
  • Robustness gains appear most clearly on complex multi-step spatial tasks across diverse benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selective-imagination pattern could be tested on other structured reasoning domains that require maintaining hidden states over multiple steps.
  • Integration with external simulators or 3D environments might further reduce errors in tasks where pure generation of images is insufficient.
  • The approach implies that future models could learn to switch between representation modes based on detected instability in their own reasoning traces.

Load-bearing premise

The difficulty-aware data engine with closed-loop verification trains the model to invoke visual imagination selectively without introducing new errors into the overall reasoning trace.

What would settle it

An ablation study that removes the visual imagination component while keeping the text chain-of-thought and data engine, then measures accuracy drops specifically on multi-step spatial reasoning benchmarks.

Figures

Figures reproduced from arXiv: 2604.17385 by Bin Zhu, Jingjing Chen, Shaoxiang Chen, Tianwen Qian, Yang Jiao, Yian Li, Yu-Gang Jiang.

Figure 1
Figure 1. Figure 1: Unlike text-only reasoning (middle), which often [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the data engine for interleaved visual reasoning. The pipeline isolates complex spatial bottlenecks [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Statistics of the curated dataset. Left: Hierarchical [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The architecture and reasoning paradigms of SpatialImaginer. The model employs a unified framework (bottom) that [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

Spatial intelligence, which refers to the ability to reason about geometric and physical structure from visual observations, remains a core challenge for multimodal large language models. Despite promising performance, recent multimodal large language models (MLLMs) often exhibit fragile reasoning traces in spatial intelligence tasks that involve consistent spatial state recognition. We argue that these failures stem from a mismatch between the spatial recognition mechanism and the text-only reasoning behavior of these MLLMs. Effective spatial reasoning requires low-level geometric structure to be faithfully preserved and updated throughout the reasoning process, whereas textual representations tend to abstract away precisely these critical details. To address this issue, we propose SpatialImaginer, a unified multimodal generation framework that integrates textual reasoning with visual imagination. Our framework adopts a divide-and-conquer strategy, using text chain-of-thought for high-level semantic planning and the visual imagination for geometry-sensitive state transformation and consistency preservation. To support this capability, we further introduce a difficulty-aware data engine with closed-loop verification to train the model to invoke visual imagination selectively when stable spatial state tracking is required. Extensive experiments on diverse spatial intelligence benchmarks show that SpatialImaginer achieves state-of-the-art performance and substantially improves robustness on complex multi-step spatial reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SpatialImaginer, a unified multimodal generation framework for spatial reasoning in MLLMs. It integrates textual chain-of-thought reasoning for high-level semantic planning with visual imagination for geometry-sensitive state transformation and consistency preservation via a divide-and-conquer strategy. A difficulty-aware data engine with closed-loop verification is introduced to train selective invocation of visual imagination for stable spatial state tracking. The authors claim that extensive experiments on diverse spatial intelligence benchmarks demonstrate state-of-the-art performance and substantially improved robustness on complex multi-step tasks.

Significance. If the central claims hold, the work could meaningfully advance multimodal spatial reasoning by addressing the abstraction gap in text-only traces. The selective visual imagination mechanism offers a plausible adaptive strategy for tasks requiring geometric fidelity, and the data engine concept provides a training paradigm worth exploring in related multimodal settings.

major comments (2)
  1. [Difficulty-aware data engine with closed-loop verification] In the description of the difficulty-aware data engine and closed-loop verification, the manuscript does not demonstrate that the verification explicitly measures preservation of low-level geometric invariants (object positions, orientations, containment relations) across multi-step updates, as opposed to high-level semantic consistency or downstream task accuracy. This is load-bearing for the claim that the engine trains selective invocation without injecting new inconsistencies in visual imagination outputs.
  2. [Experimental evaluation] The experimental results section asserts SOTA performance and robustness gains but supplies no quantitative metrics, baseline comparisons, ablation studies isolating the visual imagination component, or error analysis on multi-step traces. Without these, the central empirical claim cannot be evaluated.
minor comments (2)
  1. [Abstract] The abstract would benefit from naming the specific benchmarks and reporting key quantitative improvements to ground the SOTA claim.
  2. [Method] Notation and component names (e.g., for the visual imagination module and verification loop) should be introduced with explicit definitions or pseudocode for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major point below and commit to revisions that strengthen the presentation of the data engine and experimental results without altering the core contributions.

read point-by-point responses
  1. Referee: In the description of the difficulty-aware data engine and closed-loop verification, the manuscript does not demonstrate that the verification explicitly measures preservation of low-level geometric invariants (object positions, orientations, containment relations) across multi-step updates, as opposed to high-level semantic consistency or downstream task accuracy. This is load-bearing for the claim that the engine trains selective invocation without injecting new inconsistencies in visual imagination outputs.

    Authors: We acknowledge that the current manuscript description emphasizes high-level semantic consistency and downstream task accuracy in the closed-loop verification. The engine does incorporate geometric checks via rendered state comparisons in the data generation pipeline, but these are not explicitly quantified as low-level invariant preservation metrics. To address this, we will revise the relevant section to include explicit verification of object positions, orientations, and containment relations across multi-step updates, supported by additional quantitative analysis of invariant preservation rates. This will better substantiate the selective invocation claim. revision: yes

  2. Referee: The experimental results section asserts SOTA performance and robustness gains but supplies no quantitative metrics, baseline comparisons, ablation studies isolating the visual imagination component, or error analysis on multi-step traces. Without these, the central empirical claim cannot be evaluated.

    Authors: We agree that the experimental section requires more detailed substantiation. While the manuscript reports overall benchmark results supporting the SOTA and robustness claims, it does not sufficiently present the requested quantitative breakdowns, full baseline tables, ablations focused on the visual imagination module, or multi-step error analysis. We will expand this section in the revision with additional tables, ablation studies, and error categorization to allow proper evaluation of the contributions. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical framework with independent benchmark validation

full rationale

The paper describes a proposed framework (SpatialImaginer) using a divide-and-conquer strategy and a difficulty-aware data engine, supported by experimental results on spatial intelligence benchmarks. No equations, derivations, or first-principles reductions are present in the abstract or described text. Claims of SOTA performance and improved robustness are tied to external evaluations rather than self-referential definitions, fitted parameters renamed as predictions, or self-citation chains. The verification loop is presented as a training component without reducing to tautological inputs by construction. This is a standard empirical proposal without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption about limitations of text-only reasoning in MLLMs and the effectiveness of the proposed visual imagination integration, which is not independently verified in the abstract.

axioms (1)
  • domain assumption Textual representations in MLLMs abstract away critical low-level geometric details necessary for consistent spatial reasoning.
    Core argument in the abstract for why current MLLMs fail on spatial tasks.
invented entities (1)
  • Visual imagination mechanism no independent evidence
    purpose: To preserve and update low-level geometric structure during reasoning
    Introduced as part of the framework to address the mismatch between text reasoning and spatial needs.

pith-pipeline@v0.9.0 · 5525 in / 1196 out tokens · 33889 ms · 2026-05-10T06:27:49.660262+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 30 canonical work pages · 14 internal anchors

  1. [1]

    Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 2016. 3D Semantic Parsing of Large-Scale Indoor Spaces. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  2. [2]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...

  4. [4]

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shul- man. 2021. ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data. InAdvances in Neural Information Processing Systems (NeurIPS)

  5. [5]

    Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. 2025. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719 (2025)

  6. [6]

    Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. 2025. Holistic Evaluation of Multimodal LLMs on Spatial Intelligence.arXiv preprint arXiv:2508...

  7. [7]

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. 2024. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. 14455–14465

  8. [8]

    QwenLM Team (Alibaba Cloud). 2025. Qwen3-VL: Multimodal large language model series. https://github.com/QwenLM/Qwen3-VL. GitHub repository; accessed: 2025-11-14

  9. [9]

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  10. [10]

    Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, et al. 2023. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Information Processing Systems36 (2023), 2252–2274

  11. [11]

    Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi

  12. [12]

    In Advances in Neural Information Processing Systems (NeurIPS)

    ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. In Advances in Neural Information Processing Systems (NeurIPS)

  13. [13]

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan

  14. [14]

    Emerging Properties in Unified Multimodal Pretraining.arXiv preprint arXiv:2505.14683(2025)

  15. [15]

    Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. 2024. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 346–355

  16. [16]

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. 2025. Vlm-3r: Vision- language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279(2025)

  17. [17]

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. 2024. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision. Springer, 148–166

  18. [18]

    Gemini. 2025. Gemini 3 Pro Model Card. Technical report, Gemini. Accessed: 2025-11-18

  19. [19]

    Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, and Yu Cheng. 2025. Thinkmorph: Emergent prop- erties in multimodal interleaved chain-of-thought reasoning.arXiv preprint arXiv:2510.27492(2025)

  20. [20]

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. 2025. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006(2025)

  21. [21]

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. 2024. Openai o1 system card.arXiv preprint arXiv:2412.16720(2024)

  22. [22]

    Yang Jiao, Haibo Qiu, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, and Yu-Gang Jiang. 2025. Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding. InProceedings of the Computer Vision and Pattern Recognition Conference. 3600–3610

  23. [23]

    Black Forest Labs. 2024. FLUX. https://github.com/black-forest-labs/flux

  24. [24]

    Ang Li, Charles Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, et al. 2025. Zebra-cot: A dataset for interleaved vision language reasoning.arXiv preprint arXiv:2507.16746(2025)

  25. [25]

    Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vulić, and Furu Wei. 2025. Imagine while Reasoning in Space: Multimodal Visualization-of-Thought.arXiv preprint arXiv:2501.07542(2025)

  26. [26]

    Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, et al. 2025. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision- language models.arXiv preprint arXiv:2505.21500(2025)

  27. [27]

    Haoyuan Li, Qihang Cao, Tao Tang, Kun Xiang, Zihan Guo, Jianhua Han, Hang Xu, and Xiaodan Liang. 2026. Thinking with Geometry: Active Geometry Integration for Spatial Reasoning.arXiv preprint arXiv:2602.06037(2026)

  28. [28]

    Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. 2025. Spatialladder: Progressive training for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531(2025)

  29. [29]

    Xinghan Li, Junhao Xu, and Jingjing Chen. 2026. VIGIL: Part-Grounded Structured Reasoning for Generalizable Deepfake Detection.arXiv preprint arXiv:2603.21526 (2026)

  30. [30]

    Yian Li, Wentao Tian, Yang Jiao, Tianwen Qian, Na Zhao, Bin Zhu, Jingjing Chen, and Yu-Gang Jiang. 2025. Look before you decide: Prompting active deduction of mllms for assumptive reasoning. InProceedings of the 33rd ACM International Conference on Multimedia. 2713–2722

  31. [31]

    Guoshan Liu, Bin Zhu, Yian Li, Jingjing Chen, Chong-Wah Ngo, and Yu-Gang Jiang. 2026. Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation.arXiv preprint arXiv:2602.15862(2026)

  32. [32]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in neural information processing systems36 (2023), 34892–34916

  33. [33]

    Xingchao Liu, Chengyue Gong, and Qiang Liu. 2022. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003(2022). Conference’17, July 2017, Washington, DC, USA Yian Li et al

  34. [34]

    Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101(2017)

  35. [35]

    Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 2025. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision. 6924–6934

  36. [36]

    OpenAI. 2025. GPT-5 System Card. Technical report, OpenAI. Accessed: 2025-08-10

  37. [37]

    Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. 2025. SpaceR: Reinforcing MLLMs in Video Spatial Reasoning. arXiv preprint arXiv:2504.01805(2025)

  38. [38]

    Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Peters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard Newcombe, and Yuheng Carl Ren. 2023. Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Percep- tion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

  39. [39]

    Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. 2021. Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

  40. [40]

    Shenna Shepard and Douglas Metzler. 1988. Mental rotation: effects of dimen- sionality of objects and type of task.Journal of experimental psychology: Human perception and performance14, 1 (1988), 3

  41. [41]

    ByteDance Seed Team. 2025. Seed1.5-VL Technical Report.arXiv preprint arXiv:2505.07062(2025)

  42. [42]

    Chameleon Team. 2024. Chameleon: Mixed-modal early-fusion foundation mod- els.arXiv preprint arXiv:2405.09818(2024)

  43. [43]

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. 2025. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786(2025)

  44. [44]

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rup- precht, and David Novotny. 2025. VGGT: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition,. 5294–5306

  45. [45]

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. 2025. Internvl3. 5: Ad- vancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265(2025)

  46. [46]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems35 (2022), 24824–24837

  47. [47]

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. 2025. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747(2025)

  48. [48]

    Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. 2025. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965 (2025)

  49. [49]

    Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei. 2024. Mind’s eye of LLMs: visualization-of-thought elicits spatial reasoning in large language models.Advances in Neural Information Processing Systems37 (2024), 90277–90317

  50. [50]

    2025.Grok 4

    xAI. 2025.Grok 4. https://x.ai/news/grok-4 Model announcement

  51. [51]

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

  52. [52]

    InProceedings of the Computer Vision and Pattern Recognition Conference

    Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference. 10632–10643

  53. [53]

    Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, and Hengshuang Zhao. 2025. Visual Spatial Tuning.arXiv preprint arXiv:2511.05491(2025)

  54. [54]

    Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. 2025. Cambrian-s: Towards spatial supersensing in video.arXiv preprint arXiv:2511.04670(2025)

  55. [55]

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. 2023. ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

  56. [56]

    Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, et al

  57. [57]

    InProceedings of the IEEE/CVF International Conference on Computer Vision Workshop

    Spatial mental modeling from limited views. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshop

  58. [58]

    Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. 2025. From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976(2025)

  59. [59]

    Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. 2025. Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors.arXiv preprint arXiv:2505.24625(2025)

  60. [60]

    Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. 2020. Structured3D: A Large Photo-realistic Dataset for Structured 3D Modeling. In Proceedings of The European Conference on Computer Vision (ECCV)

  61. [61]

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. 2025. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479(2025)