arxiv: 2604.26934 · v1 · submitted 2026-04-29 · 💻 cs.CV

Recognition: unknown

World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning

Wanyue Zhang , Wenxiang Wu , Wang Xu , Jiaxin Luo , Helu Zhi , Yibin Huang , Shuo Ren , Zitao Liu

show 1 more author

Jiajun Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsworld modelsspatial reasoningdistillationdynamic reasoningview synthesisegocentric motion

0 comments

The pith

Distilling future views from a view-consistent world model into a VLM lets it internalize dynamic spatial reasoning without test-time generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops World2VLM to address the gap in vision-language models for tasks that require predicting how scenes change under camera motion. It generates training signals by using a world model to create future views aligned with given trajectories and then extracts both forward and inverse reasoning examples from those views. The VLM is post-trained on this data in two stages so the capability becomes part of its parameters rather than an external module. This matters because current alternatives either add heavy computation at inference or lack explicit motion modeling, limiting practical use in settings that need fast spatial understanding.

Core claim

World2VLM distills spatial imagination from a generative world model into a vision-language model. Given an initial observation and a parameterized camera trajectory, a view-consistent world model synthesizes geometrically aligned future views. Structured supervision is derived for forward (action-to-outcome) and inverse (outcome-to-action) spatial reasoning. The VLM receives a two-stage post-training recipe on the resulting compact dataset. This produces consistent gains over the base model on SAT-Real, SAT-Synthesized, VSI-Bench, and MindCube while outperforming test-time world-model-coupled approaches and removing their inference-time cost.

What carries the argument

The view-consistent world model that synthesizes future views and supplies forward and inverse supervision signals for the two-stage post-training of the VLM.

If this is right

VLMs show measurable gains on SAT-Real, SAT-Synthesized, VSI-Bench, and MindCube after the distillation process.
Performance exceeds that of methods that attach a world model only at inference time.
The computational expense of generating world-model outputs during deployment is eliminated.
Spatial imagination becomes an internalized capability rather than an external add-on.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation pattern could be applied to teach VLMs other forms of predictive reasoning that currently require separate generative modules.
Lower inference cost could make these models more practical for real-time egocentric tasks such as navigation or manipulation.
The two-stage recipe might be reused with different world models or motion parameterizations to target new scene dynamics.

Load-bearing premise

The future views produced by the world model are geometrically accurate enough for the derived supervision to improve the VLM without introducing distribution shift or label noise.

What would settle it

If the post-trained VLM shows no gains or worse results on the spatial benchmarks when the world model's synthesized views are replaced by lower-fidelity or randomly altered images, the claim that accurate geometric supervision is what drives the improvement would be undermined.

Figures

Figures reproduced from arXiv: 2604.26934 by Helu Zhi, Jiajun Zhang, Jiaxin Luo, Shuo Ren, Wang Xu, Wanyue Zhang, Wenxiang Wu, Yibin Huang, Zitao Liu.

**Figure 1.** Figure 1: Motivation for World2VLM. Existing approaches either view at source ↗

**Figure 2.** Figure 2: Overview of World2VLM. The framework consists of three components. view at source ↗

**Figure 3.** Figure 3: Absolute benchmark scores of World2VLM-SFT on four view at source ↗

read the original abstract

Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts address this limitation either by scaling spatial supervision with synthetic data or by coupling VLMs with world models at inference time. However, the former often lacks explicit modeling of motion-conditioned state transitions, while the latter incurs substantial computational overhead. In this work, we propose World2VLM, a training framework that distills spatial imagination from a generative world model into a vision-language model. Given an initial observation and a parameterized camera trajectory, we use a view-consistent world model to synthesize geometrically aligned future views and derive structured supervision for both forward (action-to-outcome) and inverse (outcome-to-action) spatial reasoning. We post-train the VLM with a two-stage recipe on a compact dataset generated by this pipeline and evaluate it on multiple spatial reasoning benchmarks. World2VLM delivers consistent improvements over the base model across diverse benchmarks, including SAT-Real, SAT-Synthesized, VSI-Bench, and MindCube. It also outperforms the test-time world-model-coupled methods while eliminating the need for expensive inference-time generation. Our results suggest that world models can serve not only as inference-time tools, but also as effective training-time teachers, enabling VLMs to internalize spatial imagination in a scalable and efficient manner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

World2VLM distills forward and inverse spatial signals from a world model into VLM post-training, but the transfer depends on unverified geometric accuracy of the synthesized views.

read the letter

The core move here is turning a generative world model into a data generator for two-stage VLM training: it produces future views from an initial frame plus camera trajectory, then extracts structured forward (action-to-outcome) and inverse (outcome-to-action) labels. That avoids running the world model at inference time while still trying to give the VLM some internalized sense of how scenes change under motion. The approach is straightforward and targets a real gap—VLMs are decent at static scenes but weak on dynamic spatial prediction for robotics or AR use cases. Using both directions of supervision is a reasonable choice and the benchmarks (SAT-Real, VSI-Bench, MindCube) are relevant. If the synthesized views are clean enough, this could be a practical way to scale spatial imagination without test-time cost. The main weakness is the missing check on whether those views actually are geometrically accurate. The abstract calls the world model “view-consistent” and the views “geometrically aligned,” yet there are no numbers on synthesis quality—no PSNR, depth error, or consistency scores on held-out real trajectories, and no ablation that swaps in ground-truth frames to see how much the gains depend on perfect labels. If residual errors creep into the supervision, the post-training could be fitting to noise rather than true dynamics, which would cap the real-world transfer. The reported improvements over the base model and over test-time coupling are stated as consistent, but without the actual deltas, baseline details, or error analysis in the provided text it is difficult to judge how large or robust they are. This paper is for groups already working on VLM post-training or world-model distillation who need a concrete pipeline to try. It is coherent on its own terms and engages the right prior work, so it deserves a serious referee even if the experiments will need tightening on data fidelity.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces World2VLM, a training framework that distills dynamic spatial reasoning capabilities from a generative world model into a vision-language model. Given an initial observation and parameterized camera trajectory, a view-consistent world model synthesizes geometrically aligned future views; forward (action-to-outcome) and inverse (outcome-to-action) supervision signals are then derived from these views. The VLM undergoes two-stage post-training on the resulting compact dataset. The paper claims consistent performance gains over the base VLM on SAT-Real, SAT-Synthesized, VSI-Bench, and MindCube, while also outperforming test-time world-model-coupled baselines without incurring their inference-time generation costs.

Significance. If the empirical claims are substantiated and the synthesized supervision proves high-fidelity, the work would offer a meaningful advance by showing that world models can serve as efficient training-time teachers rather than runtime components. This could reduce inference overhead for spatially aware VLMs and provide a scalable path to internalizing motion-conditioned scene evolution, with potential impact on embodied AI and dynamic visual reasoning tasks.

major comments (2)

[Abstract] Abstract: the central claim of 'consistent improvements' and superiority over test-time methods is asserted without any quantitative deltas, baseline details, ablation results, or error analysis, preventing verification of the reported gains on SAT-Real, VSI-Bench, etc.
[Method] Method description (world model synthesis and supervision derivation): no quantitative validation of synthesis fidelity (e.g., PSNR, depth error, or view-consistency scores on held-out real trajectories) is reported, nor is there an ablation replacing synthesized views with ground-truth frames. This is load-bearing because residual geometric or appearance errors would directly introduce noise into the forward and inverse supervision used for the two-stage post-training, undermining the transfer assumption.

minor comments (1)

[Training recipe] The description of the two-stage post-training recipe would benefit from explicit equations or pseudocode showing how forward and inverse losses are combined and scheduled.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, indicating where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'consistent improvements' and superiority over test-time methods is asserted without any quantitative deltas, baseline details, ablation results, or error analysis, preventing verification of the reported gains on SAT-Real, VSI-Bench, etc.

Authors: We agree that the abstract is high-level and could more explicitly reference the quantitative evidence. The manuscript reports detailed performance metrics, deltas relative to the base VLM and test-time baselines, ablation studies, and error analyses in Section 4 (Experiments) and the associated tables for SAT-Real, SAT-Synthesized, VSI-Bench, and MindCube. To improve verifiability, we will revise the abstract to incorporate key quantitative highlights while preserving its length constraints. revision: yes
Referee: [Method] Method description (world model synthesis and supervision derivation): no quantitative validation of synthesis fidelity (e.g., PSNR, depth error, or view-consistency scores on held-out real trajectories) is reported, nor is there an ablation replacing synthesized views with ground-truth frames. This is load-bearing because residual geometric or appearance errors would directly introduce noise into the forward and inverse supervision used for the two-stage post-training, undermining the transfer assumption.

Authors: This is a substantive and valid concern, as the fidelity of the synthesized views underpins the quality of the forward and inverse supervision signals. Our pipeline relies on a view-consistent generative world model, with qualitative demonstrations of geometric alignment provided in the main paper and supplementary material. However, we did not include quantitative fidelity metrics (such as PSNR or depth error on held-out real trajectories) or an ablation substituting ground-truth frames for synthesized views. We will add these evaluations and the requested ablation in the revised manuscript to directly address the transfer assumption. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with external world model and benchmark evaluation

full rationale

The paper presents an empirical training framework that uses an external generative world model to synthesize views and generate supervision signals for post-training a VLM, followed by evaluation on independent benchmarks (SAT-Real, VSI-Bench, etc.). No derivation chain, equation, or claimed prediction reduces to its own inputs by construction, nor does any central result depend on a self-citation loop or fitted parameter renamed as output. The method is self-contained against external data generation and standard test sets, with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated. The approach presupposes an existing view-consistent generative world model whose outputs are treated as reliable supervision.

pith-pipeline@v0.9.0 · 5580 in / 1131 out tokens · 78688 ms · 2026-05-07T09:15:07.370166+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 17 canonical work pages · 3 internal anchors

[1]

Walk through paintings: Egocentric world models from internet priors, 2026

Anurag Bagchi, Zhipeng Bao, Homanga Bharadhwaj, Yu- Xiong Wang, Pavel Tokmakov, and Martial Hebert. Walk through paintings: Egocentric world models from internet priors, 2026. 3

2026
[2]

Qwen2.5-vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,...

2025
[3]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024. 3

2024
[4]

Ge- nie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Ge- nie: Generative interactive environments. InInternational Conference on Machine Learning, 2024. 3

2024
[5]

Cookbench: A long-horizon embodied planning benchmark for complex cooking scenarios, 2025

Muzhen Cai, Xiubo Chen, Yining An, Jiaxin Zhang, Xuesong Wang, Wang Xu, Weinan Zhang, and Ting Liu. Cookbench: A long-horizon embodied planning benchmark for complex cooking scenarios, 2025. 1

2025
[6]

Spatialdreamer: Incentivizing spatial reasoning via active mental imagery.arXiv preprint arXiv:2512.07733,

Meng Cao, Xingyu Li, Xue Liu, Ian Reid, and Xiaodan Liang. Spatialdreamer: Incentivizing spatial reasoning via active mental imagery.arXiv preprint arXiv:2512.07733,

work page arXiv
[7]

Guibas, and Fei Xia

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas J. Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities.IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465, 2024. 2

2024
[8]

Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas

Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, and Manling Li. Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas. InProceedings of the International Conference on Machine Learning, 2025. 3

2025
[9]

Embodiedeval: Evaluate multi- modal llms as embodied agents

Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, Lei Shi, and Maosong Sun. Embodiedeval: Evaluate multi- modal llms as embodied agents. InThe Annual Meeting of the Association for Computational Linguistics, 2025. 1

2025
[10]

Assess- ing the value of visual input: A benchmark of multimodal large language models for robotic path planning, 2025

Jacinto Colan, Ana Davila, and Yasuhisa Hasegawa. Assess- ing the value of visual input: A benchmark of multimodal large language models for robotic path planning, 2025. 1

2025
[11]

Chang, M

Angela Dai, Angel X. Chang, M. Savva, Maciej Halber, T. Funkhouser, and M. Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes.IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 2432– 2443, 2017. 6

2017
[12]

Mm-spatial: Exploring 3d spatial understanding in multimodal llms

Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, and Peter Grasch. Mm-spatial: Exploring 3d spatial understanding in multi- modal llms.arXiv preprint arXiv:2503.13111, 2025. 2

work page arXiv 2025
[13]

EmbSpatial-bench: Benchmarking spatial un- derstanding for embodied tasks with large vision-language models

Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. EmbSpatial-bench: Benchmarking spatial un- derstanding for embodied tasks with large vision-language models. InProceedings of the Annual Meeting of the Associ- ation for Computational Linguistics, pages 346–355, 2024. 1

2024
[14]

Liveworld: Simulating out-of- sight dynamics in generative video world models, 2026

Zicheng Duan, Jiatong Xia, Zeyu Zhang, Wenbo Zhang, Gengze Zhou, Chenhui Gou, Yefei He, Feng Chen, Xinyu Zhang, and Lingqiao Liu. Liveworld: Simulating out-of- sight dynamics in generative video world models, 2026. 3

2026
[15]

Video-of-thought: Step-by-step video reasoning from perception to cognition

Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. Video-of-thought: Step-by-step video reasoning from perception to cognition. InProceedings of the International Conference on Machine Learning, 2024. 2, 3

2024
[16]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Com- puter Vision, pages 148–166. Springer, 2024. 2

2024
[17]

MLLM-bench: Evaluating multimodal LLMs with per-sample criteria

Wentao Ge, Shunian Chen, Hardy Chen, Nuo Chen, Juny- ing Chen, Zhihong Chen, Wenya Xie, Shuo Yan, Chenghao Zhu, Ziyue Lin, Dingjie Song, Xidong Wang, Anningzhe Gao, Zhang Zhiyi, Jianquan Li, Xiang Wan, and Benyou Wang. MLLM-bench: Evaluating multimodal LLMs with per-sample criteria. InProceedings of the Conference of the Nations of the Americas Chapter of...

2025
[18]

Spatial reasoning with vision-language models in ego-centric multi-view scenes, 2025

Mohsen Gholami, Ahmad Rezaei, Zhou Weimin, Sitong Mao, Shunbo Zhou, Yong Zhang, and Mohammad Akbari. Spatial reasoning with vision-language models in ego-centric multi-view scenes, 2025. 2

2025
[19]

Spatial-ormllm: Improve spatial rela- tion understanding in the operating room with multimodal large language model.arXiv preprint arXiv:2508.08199,

Peiqi He, Zhenhao Zhang, Yixiang Zhang, Xiongjun Zhao, and Shaoliang Peng. Spatial-ormllm: Improve spatial rela- tion understanding in the operating room with multimodal large language model.arXiv preprint arXiv:2508.08199,

work page arXiv
[20]

Lora: Low-rank adaptation of large language models.In- ternational Conference on Learning Representations, 1(2): 3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.In- ternational Conference on Learning Representations, 1(2): 3, 2022. 7

2022
[21]

Unleashing spatial reasoning in multimodal large language models via textual representation guided reasoning, 2026

Jiacheng Hua, Yishu Yin, Yuhang Wu, Tai Wang, Yifei Huang, and Miao Liu. Unleashing spatial reasoning in multimodal large language models via textual representation guided reasoning, 2026. 3

2026
[22]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 1

work page internal anchor Pith review arXiv 2024
[23]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models, 2025

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models, 2025. 2

2025
[24]

10 Wpt: World-to-policy transfer via online world model distil- lation, 2026

Guangfeng Jiang, Yueru Luo, Jun Liu, Yi Huang, Yiyao Zhu, Zhan Qu, Dave Zhenyu Chen, Bingbing Liu, and Xu Yan. 10 Wpt: World-to-policy transfer via online world model distil- lation, 2026. 3

2026
[25]

What’s “up” with vision-language models? investigating their strug- gle with spatial reasoning

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their strug- gle with spatial reasoning. InProceedings of the Confer- ence on Empirical Methods in Natural Language Processing, pages 9161–9175, 2023. 2

2023
[26]

Plummer, Rose Hendrix, Kuo-Hao Zeng, Reuben Tan, and Arijit Ray

Ranjay Krishna, Jiafei Duan, Kiana Ehsani, Kate Saenko, Aniruddha Kembhavi, Dina Bashkirova, Bryan A. Plummer, Rose Hendrix, Kuo-Hao Zeng, Reuben Tan, and Arijit Ray. Sat: Spatial aptitude training for multimodal language mod- els, 2024. 2

2024
[27]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1

work page internal anchor Pith review arXiv 2024
[28]

Imag- ine while reasoning in space: Multimodal visualization-of- thought

Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli´c, and Furu Wei. Imag- ine while reasoning in space: Multimodal visualization-of- thought. InProceedings of the International Conference on Machine Learning, 2025. 2, 3

2025
[29]

Spatialladder: Progressive training for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531,

Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Progressive train- ing for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531, 2025. 2

work page arXiv 2025
[30]

Spatialladder: Progressive training for spatial reasoning in vision-language models

Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Progressive training for spatial reasoning in vision-language models. In The International Conference on Learning Representations,
[31]

Manipllm: Embodied multimodal large language model for object-centric robotic manipulation, 2023

Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yux- ing Long, Yan Shen, Renrui Zhang, Jiaming Liu, and Hao Dong. Manipllm: Embodied multimodal large language model for object-centric robotic manipulation, 2023. 1

2023
[32]

Thinking with camera: A unified multimodal model for camera-centric understanding and generation

Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, and Chen Change Loy. Thinking with camera: A unified multimodal model for camera-centric understanding and generation. InThe Inter- national Conference on Learning Representations, 2026. 2

2026
[33]

World model on million-length video and language with blockwise ringattention

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention.arXiv preprint arXiv:2402.08268, 2024. 1

work page arXiv 2024
[34]

Spatialcot: Advancing spatial reason- ing through coordinate alignment and chain-of-thought for embodied task planning, 2025

Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yaochen Hu, Lingfeng Zhang, Yingxue Zhang, Shuang Wu, Tongtong Cao, Guowei Huang, Helong Huang, Guangjian Tian, Weichao Qiu, Xingyue Quan, Jianye Hao, and Yuzheng Zhuang. Spatialcot: Advancing spatial reason- ing through coordinate alignment and chain-of-thought for embodied task planning, 2025. 2, 3

2025
[35]

3dsrbench: A comprehensive 3d spatial reasoning benchmark

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 6924–6934, 2025. 2

2025
[36]

Spatial- reasoner: Towards explicit and generalizable 3d spatial rea- soning

Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso M de Melo, Jianwen Xie, and Alan Yuille. Spatial- reasoner: Towards explicit and generalizable 3d spatial rea- soning. InThe Annual Conference on Neural Information Processing Systems, 2025. 3

2025
[37]

Mmiu: Multimodal multi-image understanding for evaluating large vision-language models

Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, et al. Mmiu: Multimodal multi-image understanding for evaluating large vision-language models. InThe Interna- tional Conference on Learning Representations, 2025. 2

2025
[38]

Spare: Enhancing spatial rea- soning in vision-language models with synthetic data

Michael Ogezi and Freda Shi. Spare: Enhancing spatial rea- soning in vision-language models with synthetic data. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7855–7875, 2025. 2

2025
[39]

Be- yond semantics: Rediscovering spatial awareness in vision- language models.arXiv preprint arXiv:2503.17349, 2025

Yuan Qi et al. Beyond semantics: Rediscovering spa- tial awareness in vision-language models.arXiv preprint arXiv:2503.17349, 2025. 3

work page arXiv 2025
[40]

Towards grounded visual spatial reasoning in multi-modal vision language models

Reza Rajabi et al. Towards grounded visual spatial reasoning in multi-modal vision language models. InICLR Workshop,
[41]

Sat: Dynamic spatial aptitude training for multimodal language models

Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, Kuo-Hao Zeng, et al. Sat: Spa- tial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755, 3, 2024. 6

work page arXiv 2024
[42]

Gen3c: 3d-informed world-consistent video generation with precise camera con- trol

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas M ¨uller, Alexan- der Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera con- trol. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2025. 3

2025
[43]

Mackey, David J

Rishabh Sharma, Gijs Hogervorst, Wayne E. Mackey, David J. Heeger, and Stefano Martiniani. Cross-view world models, 2026. 3

2026
[44]

From reactive to cognitive: brain-inspired spatial intelligence for embodied agents, 2025

Hang Su, Songming Liu, Liyuan Wang, Xingxing Wei, Caixin Kang, Shouwei Ruan, and Qihui Zhu. From reactive to cognitive: brain-inspired spatial intelligence for embodied agents, 2025. 1

2025
[45]

Hunyuanvideo: A systematic frame- work for large video generative models, 2025

Qi Tian, Songtao Liu, Di Wang, Yangyu Tao, Yang Li, Fang Yang, Yuanbo Peng, Jinbao Xue, Xin Li, Hongfa Wang, An- dong Wang, Zunnan Xu, Shuai Li, Yi Chen, Yanxin Long, Hao Tan, Junkun Yuan, Qin Lin, Kai Wang, Bo Wu, Jie Jiang, Weiyan Wang, Zijian Zhang, Zuozhuo Dai, Changlin Li, Jianwei Zhang, Yuhong Liu, Yong Yang, Hongmei Wang, Yutao Cui, Jiawang Bai, Ji...

2025
[46]

Site: towards spatial intelligence thorough evaluation

Wenqi Wang, Reuben Tan, Pengyue Zhu, Jianwei Yang, Zhengyuan Yang, Lijuan Wang, Andrey Kolobov, Jianfeng Gao, and Boqing Gong. Site: towards spatial intelligence thorough evaluation. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 9058–9069,
[47]

Anchorweave: World-consistent video generation with retrieved local spatial memories, 2026

Zun Wang, Han Lin, Jaehong Yoon, Jaemin Cho, Yue Zhang, 11 and Mohit Bansal. Anchorweave: World-consistent video generation with retrieved local spatial memories, 2026. 3

2026
[48]

Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence, 2025

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence, 2025. 3

2025
[49]

SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, and Weidi Xie. Spatialscore: Towards unified evalu- ation for multimodal spatial understanding.arXiv preprint arXiv:2505.17012, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Camreasoner: Reinforcing camera movement understanding via structured spatial rea- soning, 2026

Hang Wu, Yujun Cai, Zehao Li, Haonan Ge, Bowen Sun, Junsong Yuan, and Yiwei Wang. Camreasoner: Reinforcing camera movement understanding via structured spatial rea- soning, 2026. 3

2026
[51]

Reinforcing spatial reasoning in vision-language models with interwoven think- ing and visual drawing, 2025

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven think- ing and visual drawing, 2025. 3

2025
[52]

Grounded chain-of-thought for multimodal large language models,

Qiong Wu, Xiangcong Yang, Yiyi Zhou, Chenxin Fang, Baiyang Song, Xiaoshuai Sun, and Rongrong Ji. Grounded chain-of-thought for multimodal large language models,
[53]

Spatial mental modeling from limited views, 2025

Saining Xie, Han Liu, Ranjay Krishna, Li Fei-Fei, Jiajun Wu, Zihan Wang, Manling Li, Jieyu Zhang, Jianshu Zhang, Qineng Wang, Kangrui Wang, Keshigeyan Chandrasegaran, Baiqiao Yin, and Pingyue Zhang. Spatial mental modeling from limited views, 2025. 2, 3, 6

2025
[54]

Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment

Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, and Fu Lee Wang. Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assess- ment.arXiv preprint arXiv:2312.12148, 2023. 7

work page arXiv 2023
[55]

Mind- journey: Test-time scaling with world models for spatial rea- soning, 2025

Jianwei Yang, Yilun Du, Chuang Gan, Siyuan Zhou, Reuben Tan, Yuncong Yang, Zheyuan Zhang, and Jiageng Liu. Mind- journey: Test-time scaling with world models for spatial rea- soning, 2025. 2, 3, 6, 7, 19

2025
[56]

Thinking in space: How mul- timodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 2, 6

2025
[57]

Mmsi-bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. Mmsi-bench: A benchmark for multi- image spatial intelligence.arXiv preprint arXiv:2505.23764,

work page arXiv
[58]

Cambrian-S: Towards Spatial Supersensing in Video.arXiv preprint arXiv:2511.04670, 2025

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zi- hao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersens- ing in video.arXiv preprint arXiv:2511.04670, 2025. 3

work page arXiv 2025
[59]

Neoverse: Enhancing 4d world model with in-the-wild monocular videos.arXiv preprint arXiv:2601.00393, 2026

Yuxue Yang, Lue Fan, Ziqi Shi, Junran Peng, Feng Wang, and Zhaoxiang Zhang. Neoverse: Enhancing 4d world model with in-the-wild monocular videos.arXiv preprint arXiv:2601.00393, 2026. 3

work page arXiv 2026
[60]

Seeing from another per- spective: Evaluating multi-view understanding in mllms

Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Ruoyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao, and Yi Ma. Seeing from another per- spective: Evaluating multi-view understanding in mllms. In Neural Information Processing Systems, 2025. 2

2025
[61]

How far are vlms from visual spatial intel- ligence? a benchmark-driven perspective.arXiv preprint arXiv:2509.18905, 2025

Songsong Yu, Yuxin Chen, Hao Ju, Lianjie Jia, Fuxi Zhang, Shaofei Huang, Yuhan Wu, Rundi Cui, Binghao Ran, Za- ibin Zhang, et al. How far are vlms from visual spatial in- telligence? a benchmark-driven perspective.arXiv preprint arXiv:2509.18905, 2025. 2

work page arXiv 2025
[62]

Actial: Activate spatial reasoning ability of multimodal large language models

Xiaoyu Zhan, Wenxuan Huang, Hao Sun, Xinyu Fu, Changfeng Ma, Shaosheng Cao, Bohan Jia, Shaohui Lin, Zhenfei Yin, LEI BAI, Wanli Ouyang, Yuanqi Li, Jie Guo, and Yanwen Guo. Actial: Activate spatial reasoning ability of multimodal large language models. InThe Annual Con- ference on Neural Information Processing Systems, 2025. 3

2025
[63]

From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976, 2025

Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976, 2025. 2

work page arXiv 2025
[64]

Why do mllms struggle with spatial understanding? a systematic analysis from data to architecture, 2025

Jiajun Zhang, Shuo Ren, Wang Xu, Yibin Huang, Wanyue Zhang, Yangbin Xu, JingJing Huang, and Helu Zhi. Why do mllms struggle with spatial understanding? a systematic analysis from data to architecture, 2025. 2, 6

2025
[65]

Can world models benefit vlms for world dynamics?, 2025

Kevin Zhang, Kuangzhi Ge, Xiaowei Chi, Renrui Zhang, Shaojun Shi, Zhen Dong, Sirui Han, and Shanghang Zhang. Can world models benefit vlms for world dynamics?, 2025. 2, 3

2025
[66]

Think3d: Think- ing with space for spatial reasoning, 2026

Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, Lijun Wang, and Huchuan Lu. Think3d: Think- ing with space for spatial reasoning, 2026. 3

2026
[67]

Multimodal spatial reasoning in the large model era: A survey and benchmarks.arXiv preprint arXiv:2510.25760, 2025

Xu Zheng, Zihao Dongfang, Lutao Jiang, Boyuan Zheng, Yulong Guo, Zhenquan Zhang, Giuliano Albanese, Runyi Yang, Mengjiao Ma, Zixin Zhang, et al. Multimodal spatial reasoning in the large model era: A survey and benchmarks. arXiv preprint arXiv:2510.25760, 2025. 3

work page arXiv 2025
[68]

Stable virtual camera: Generative view synthesis with diffusion models

Jensen Zhou, Hang Gao, Vikram V oleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12405–12414, 2025. 3, 6

2025
[69]

A1–A4” denotes motion-only prompts and “D1–D4

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Con- ghui He, Botian Shi, Xingchen...

2025