arxiv: 2604.17385 · v1 · submitted 2026-04-19 · 💻 cs.CV

Recognition: unknown

SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning

Yian Li , Yang Jiao , Bin Zhu , Tianwen Qian , Shaoxiang Chen , Jingjing Chen , Yu-Gang Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords spatial reasoningvisual imaginationmultimodal large language modelschain-of-thought reasoningspatial intelligencestate trackingdata engineclosed-loop verification

0 comments

The pith

Multimodal models achieve more reliable spatial reasoning by selectively generating visual images to track geometric states during text-based planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models often fail at spatial tasks because text reasoning discards the precise geometric details needed to maintain consistent states across steps. SpatialImaginer counters this by splitting the process: text chain-of-thought handles high-level semantic plans while visual imagination performs the actual geometry updates and consistency checks. A difficulty-aware data engine with closed-loop verification teaches the model when to trigger imagination rather than relying on text alone. Experiments across spatial intelligence benchmarks confirm higher accuracy and greater robustness, especially on multi-step problems that require repeated state tracking.

Core claim

The paper establishes that effective spatial reasoning requires low-level geometric structure to be preserved and updated throughout the process, which text representations alone cannot achieve reliably. By adopting a divide-and-conquer framework that pairs textual planning with visual imagination for state transformations, and training this behavior via a targeted data engine, the model learns to invoke visual generation only when needed for stable tracking.

What carries the argument

The divide-and-conquer strategy that delegates high-level semantic planning to text chain-of-thought while reserving visual imagination for geometry-sensitive state transformation and consistency preservation.

If this is right

The model maintains geometric details that text abstractions normally lose, enabling longer and more stable reasoning chains.
Selective use of visual imagination occurs only on tasks requiring stable spatial state tracking, avoiding unnecessary computation on simpler problems.
Unified multimodal generation produces both text plans and visual state updates within a single framework.
Robustness gains appear most clearly on complex multi-step spatial tasks across diverse benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selective-imagination pattern could be tested on other structured reasoning domains that require maintaining hidden states over multiple steps.
Integration with external simulators or 3D environments might further reduce errors in tasks where pure generation of images is insufficient.
The approach implies that future models could learn to switch between representation modes based on detected instability in their own reasoning traces.

Load-bearing premise

The difficulty-aware data engine with closed-loop verification trains the model to invoke visual imagination selectively without introducing new errors into the overall reasoning trace.

What would settle it

An ablation study that removes the visual imagination component while keeping the text chain-of-thought and data engine, then measures accuracy drops specifically on multi-step spatial reasoning benchmarks.

Figures

Figures reproduced from arXiv: 2604.17385 by Bin Zhu, Jingjing Chen, Shaoxiang Chen, Tianwen Qian, Yang Jiao, Yian Li, Yu-Gang Jiang.

**Figure 2.** Figure 2: Overview of the data engine for interleaved visual reasoning. The pipeline isolates complex spatial bottlenecks [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Statistics of the curated dataset. Left: Hierarchical [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: The architecture and reasoning paradigms of SpatialImaginer. The model employs a unified framework (bottom) that [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Spatial intelligence, which refers to the ability to reason about geometric and physical structure from visual observations, remains a core challenge for multimodal large language models. Despite promising performance, recent multimodal large language models (MLLMs) often exhibit fragile reasoning traces in spatial intelligence tasks that involve consistent spatial state recognition. We argue that these failures stem from a mismatch between the spatial recognition mechanism and the text-only reasoning behavior of these MLLMs. Effective spatial reasoning requires low-level geometric structure to be faithfully preserved and updated throughout the reasoning process, whereas textual representations tend to abstract away precisely these critical details. To address this issue, we propose SpatialImaginer, a unified multimodal generation framework that integrates textual reasoning with visual imagination. Our framework adopts a divide-and-conquer strategy, using text chain-of-thought for high-level semantic planning and the visual imagination for geometry-sensitive state transformation and consistency preservation. To support this capability, we further introduce a difficulty-aware data engine with closed-loop verification to train the model to invoke visual imagination selectively when stable spatial state tracking is required. Extensive experiments on diverse spatial intelligence benchmarks show that SpatialImaginer achieves state-of-the-art performance and substantially improves robustness on complex multi-step spatial reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpatialImaginer adds selective visual imagination to MLLM reasoning to preserve spatial states, but the closed-loop verification may not fully guarantee geometric fidelity.

read the letter

The core idea here is straightforward: MLLMs lose spatial details in text-only chains, so the model should generate images on demand to track positions and relations during multi-step tasks. The paper frames this as a divide-and-conquer setup where text handles high-level planning and visual imagination handles geometry-sensitive updates, trained via a difficulty-aware data engine with closed-loop checks. That selective invocation mechanism is the clearest new piece compared to standard visual reasoning or CoT extensions. It directly targets a known failure mode in spatial intelligence benchmarks, and the motivation section does a clean job explaining why text abstractions drop critical low-level structure. The claimed SOTA results and robustness gains on complex tasks are presented as coming from extensive experiments, which at least shows the authors ran the necessary comparisons. The soft spot is exactly the one the stress-test flags. The verification loop needs to confirm that generated images maintain actual geometric invariants across steps, such as object positions, orientations, or containment, rather than just semantic consistency or end-task accuracy. If the checks are downstream-only, the approach could still produce fragile traces on harder cases, and the paper should include targeted metrics or error breakdowns on spatial fidelity to close that gap. Minor additional concerns include whether image generation variability affects reproducibility, but that is secondary if ablations are reported. This work is aimed at researchers building multimodal models for embodied AI or robotics who already work on spatial reasoning. A reader focused on MLLM extensions or data synthesis pipelines would extract usable ideas from the framework and engine. It deserves serious referee time because the problem is real, the proposal is concrete, and the experiments are described as broad enough to evaluate. I would send it for review with a request for explicit verification details and ablations on the geometric checks.

Referee Report

2 major / 2 minor

Summary. The paper proposes SpatialImaginer, a unified multimodal generation framework for spatial reasoning in MLLMs. It integrates textual chain-of-thought reasoning for high-level semantic planning with visual imagination for geometry-sensitive state transformation and consistency preservation via a divide-and-conquer strategy. A difficulty-aware data engine with closed-loop verification is introduced to train selective invocation of visual imagination for stable spatial state tracking. The authors claim that extensive experiments on diverse spatial intelligence benchmarks demonstrate state-of-the-art performance and substantially improved robustness on complex multi-step tasks.

Significance. If the central claims hold, the work could meaningfully advance multimodal spatial reasoning by addressing the abstraction gap in text-only traces. The selective visual imagination mechanism offers a plausible adaptive strategy for tasks requiring geometric fidelity, and the data engine concept provides a training paradigm worth exploring in related multimodal settings.

major comments (2)

[Difficulty-aware data engine with closed-loop verification] In the description of the difficulty-aware data engine and closed-loop verification, the manuscript does not demonstrate that the verification explicitly measures preservation of low-level geometric invariants (object positions, orientations, containment relations) across multi-step updates, as opposed to high-level semantic consistency or downstream task accuracy. This is load-bearing for the claim that the engine trains selective invocation without injecting new inconsistencies in visual imagination outputs.
[Experimental evaluation] The experimental results section asserts SOTA performance and robustness gains but supplies no quantitative metrics, baseline comparisons, ablation studies isolating the visual imagination component, or error analysis on multi-step traces. Without these, the central empirical claim cannot be evaluated.

minor comments (2)

[Abstract] The abstract would benefit from naming the specific benchmarks and reporting key quantitative improvements to ground the SOTA claim.
[Method] Notation and component names (e.g., for the visual imagination module and verification loop) should be introduced with explicit definitions or pseudocode for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major point below and commit to revisions that strengthen the presentation of the data engine and experimental results without altering the core contributions.

read point-by-point responses

Referee: In the description of the difficulty-aware data engine and closed-loop verification, the manuscript does not demonstrate that the verification explicitly measures preservation of low-level geometric invariants (object positions, orientations, containment relations) across multi-step updates, as opposed to high-level semantic consistency or downstream task accuracy. This is load-bearing for the claim that the engine trains selective invocation without injecting new inconsistencies in visual imagination outputs.

Authors: We acknowledge that the current manuscript description emphasizes high-level semantic consistency and downstream task accuracy in the closed-loop verification. The engine does incorporate geometric checks via rendered state comparisons in the data generation pipeline, but these are not explicitly quantified as low-level invariant preservation metrics. To address this, we will revise the relevant section to include explicit verification of object positions, orientations, and containment relations across multi-step updates, supported by additional quantitative analysis of invariant preservation rates. This will better substantiate the selective invocation claim. revision: yes
Referee: The experimental results section asserts SOTA performance and robustness gains but supplies no quantitative metrics, baseline comparisons, ablation studies isolating the visual imagination component, or error analysis on multi-step traces. Without these, the central empirical claim cannot be evaluated.

Authors: We agree that the experimental section requires more detailed substantiation. While the manuscript reports overall benchmark results supporting the SOTA and robustness claims, it does not sufficiently present the requested quantitative breakdowns, full baseline tables, ablations focused on the visual imagination module, or multi-step error analysis. We will expand this section in the revision with additional tables, ablation studies, and error categorization to allow proper evaluation of the contributions. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical framework with independent benchmark validation

full rationale

The paper describes a proposed framework (SpatialImaginer) using a divide-and-conquer strategy and a difficulty-aware data engine, supported by experimental results on spatial intelligence benchmarks. No equations, derivations, or first-principles reductions are present in the abstract or described text. Claims of SOTA performance and improved robustness are tied to external evaluations rather than self-referential definitions, fitted parameters renamed as predictions, or self-citation chains. The verification loop is presented as a training component without reducing to tautological inputs by construction. This is a standard empirical proposal without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption about limitations of text-only reasoning in MLLMs and the effectiveness of the proposed visual imagination integration, which is not independently verified in the abstract.

axioms (1)

domain assumption Textual representations in MLLMs abstract away critical low-level geometric details necessary for consistent spatial reasoning.
Core argument in the abstract for why current MLLMs fail on spatial tasks.

invented entities (1)

Visual imagination mechanism no independent evidence
purpose: To preserve and update low-level geometric structure during reasoning
Introduced as part of the framework to address the mismatch between text reasoning and spatial needs.

pith-pipeline@v0.9.0 · 5525 in / 1196 out tokens · 33889 ms · 2026-05-10T06:27:49.660262+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 30 canonical work pages · 14 internal anchors

[1]

Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 2016. 3D Semantic Parsing of Large-Scale Indoor Spaces. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

2016
[2]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shul- man. 2021. ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data. InAdvances in Neural Information Processing Systems (NeurIPS)

2021
[5]

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. 2025. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719 (2025)

work page arXiv 2025
[6]

Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. 2025. Holistic Evaluation of Multimodal LLMs on Spatial Intelligence.arXiv preprint arXiv:2508...

work page arXiv 2025
[7]

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. 2024. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. 14455–14465

2024
[8]

QwenLM Team (Alibaba Cloud). 2025. Qwen3-VL: Multimodal large language model series. https://github.com/QwenLM/Qwen3-VL. GitHub repository; accessed: 2025-11-14

2025
[9]

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

2017
[10]

Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alabdulmohsin, et al. 2023. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Information Processing Systems36 (2023), 2252–2274

2023
[11]

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi
[12]

In Advances in Neural Information Processing Systems (NeurIPS)

ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. In Advances in Neural Information Processing Systems (NeurIPS)
[13]

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan
[14]

Emerging Properties in Unified Multimodal Pretraining.arXiv preprint arXiv:2505.14683(2025)

work page internal anchor Pith review arXiv 2025
[15]

Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. 2024. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 346–355

2024
[16]

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. 2025. Vlm-3r: Vision- language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. 2024. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision. Springer, 148–166

2024
[18]

Gemini. 2025. Gemini 3 Pro Model Card. Technical report, Gemini. Accessed: 2025-11-18

2025
[19]

Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, and Yu Cheng. 2025. Thinkmorph: Emergent prop- erties in multimodal interleaved chain-of-thought reasoning.arXiv preprint arXiv:2510.27492(2025)

work page arXiv 2025
[20]

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. 2025. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006(2025)

work page internal anchor Pith review arXiv 2025
[21]

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. 2024. Openai o1 system card.arXiv preprint arXiv:2412.16720(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Yang Jiao, Haibo Qiu, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, and Yu-Gang Jiang. 2025. Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding. InProceedings of the Computer Vision and Pattern Recognition Conference. 3600–3610

2025
[23]

Black Forest Labs. 2024. FLUX. https://github.com/black-forest-labs/flux

2024
[24]

Ang Li, Charles Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, et al. 2025. Zebra-cot: A dataset for interleaved vision language reasoning.arXiv preprint arXiv:2507.16746(2025)

work page arXiv 2025
[25]

Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vulić, and Furu Wei. 2025. Imagine while Reasoning in Space: Multimodal Visualization-of-Thought.arXiv preprint arXiv:2501.07542(2025)

work page arXiv 2025
[26]

Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, et al. 2025. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision- language models.arXiv preprint arXiv:2505.21500(2025)

work page arXiv 2025
[27]

Haoyuan Li, Qihang Cao, Tao Tang, Kun Xiang, Zihan Guo, Jianhua Han, Hang Xu, and Xiaodan Liang. 2026. Thinking with Geometry: Active Geometry Integration for Spatial Reasoning.arXiv preprint arXiv:2602.06037(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. 2025. Spatialladder: Progressive training for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531(2025)

work page arXiv 2025
[29]

Xinghan Li, Junhao Xu, and Jingjing Chen. 2026. VIGIL: Part-Grounded Structured Reasoning for Generalizable Deepfake Detection.arXiv preprint arXiv:2603.21526 (2026)

work page arXiv 2026
[30]

Yian Li, Wentao Tian, Yang Jiao, Tianwen Qian, Na Zhao, Bin Zhu, Jingjing Chen, and Yu-Gang Jiang. 2025. Look before you decide: Prompting active deduction of mllms for assumptive reasoning. InProceedings of the 33rd ACM International Conference on Multimedia. 2713–2722

2025
[31]

Guoshan Liu, Bin Zhu, Yian Li, Jingjing Chen, Chong-Wah Ngo, and Yu-Gang Jiang. 2026. Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation.arXiv preprint arXiv:2602.15862(2026)

work page arXiv 2026
[32]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in neural information processing systems36 (2023), 34892–34916

2023
[33]

Xingchao Liu, Chengyue Gong, and Qiang Liu. 2022. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003(2022). Conference’17, July 2017, Washington, DC, USA Yian Li et al

work page internal anchor Pith review arXiv 2022
[34]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 2025. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision. 6924–6934

2025
[36]

OpenAI. 2025. GPT-5 System Card. Technical report, OpenAI. Accessed: 2025-08-10

2025
[37]

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. 2025. SpaceR: Reinforcing MLLMs in Video Spatial Reasoning. arXiv preprint arXiv:2504.01805(2025)

work page arXiv 2025
[38]

Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Peters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard Newcombe, and Yuheng Carl Ren. 2023. Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Percep- tion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

2023
[39]

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. 2021. Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

2021
[40]

Shenna Shepard and Douglas Metzler. 1988. Mental rotation: effects of dimen- sionality of objects and type of task.Journal of experimental psychology: Human perception and performance14, 1 (1988), 3

1988
[41]

ByteDance Seed Team. 2025. Seed1.5-VL Technical Report.arXiv preprint arXiv:2505.07062(2025)

work page internal anchor Pith review arXiv 2025
[42]

Chameleon Team. 2024. Chameleon: Mixed-modal early-fusion foundation mod- els.arXiv preprint arXiv:2405.09818(2024)

work page internal anchor Pith review arXiv 2024
[43]

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. 2025. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786(2025)

work page internal anchor Pith review arXiv 2025
[44]

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rup- precht, and David Novotny. 2025. VGGT: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition,. 5294–5306

2025
[45]

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. 2025. Internvl3. 5: Ad- vancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265(2025)

work page internal anchor Pith review arXiv 2025
[46]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems35 (2022), 24824–24837

2022
[47]

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. 2025. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747(2025)

work page arXiv 2025
[48]

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. 2025. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965 (2025)

work page arXiv 2025
[49]

Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei. 2024. Mind’s eye of LLMs: visualization-of-thought elicits spatial reasoning in large language models.Advances in Neural Information Processing Systems37 (2024), 90277–90317

2024
[50]

2025.Grok 4

xAI. 2025.Grok 4. https://x.ai/news/grok-4 Model announcement

2025
[51]

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie
[52]

InProceedings of the Computer Vision and Pattern Recognition Conference

Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference. 10632–10643
[53]

Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, and Hengshuang Zhao. 2025. Visual Spatial Tuning.arXiv preprint arXiv:2511.05491(2025)

work page arXiv 2025
[54]

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. 2025. Cambrian-s: Towards spatial supersensing in video.arXiv preprint arXiv:2511.04670(2025)

work page arXiv 2025
[55]

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. 2023. ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

2023
[56]

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, et al
[57]

InProceedings of the IEEE/CVF International Conference on Computer Vision Workshop

Spatial mental modeling from limited views. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshop
[58]

Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. 2025. From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976(2025)

work page arXiv 2025
[59]

Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. 2025. Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors.arXiv preprint arXiv:2505.24625(2025)

work page arXiv 2025
[60]

Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. 2020. Structured3D: A Large Photo-realistic Dataset for Structured 3D Modeling. In Proceedings of The European Conference on Computer Vision (ECCV)

2020
[61]

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. 2025. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479(2025)

work page internal anchor Pith review arXiv 2025