pith. machine review for the scientific record. sign in

arxiv: 2603.27437 · v3 · submitted 2026-03-28 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D spatial reasoningvision language modelshierarchical fusiongeometry integrationmultimodal learningspatial understandingembodied AI
0
0 comments X

The pith

SpatialStack progressively fuses multi-level geometric features into the language backbone to enable accurate 3D spatial reasoning in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SpatialStack as a general framework for fusing vision, geometry, and language representations hierarchically. Unlike prior methods that only combine deep features, it stacks geometric signals at multiple levels to capture both precise local details and broader context. This leads to a model called VLM-SpatialStack that sets new performance records on 3D spatial reasoning tasks. A sympathetic reader would care because current VLMs often fail at reliable spatial understanding needed for physical AI applications like robotics.

Core claim

SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, moving beyond late-stage fusion to progressively align representations across the model hierarchy, thereby capturing both local geometric precision and global contextual semantics.

What carries the argument

The progressive multi-level fusion mechanism that aligns vision, geometry, and language features at each layer of the hierarchy.

If this is right

  • Consistently improves 3D understanding across benchmarks.
  • Achieves state-of-the-art results on multiple 3D spatial reasoning tasks.
  • Generalizes robustly to diverse spatial reasoning problems.
  • Provides an extensible paradigm for integrating geometry into vision-language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future models could apply similar stacking to other geometric encoders or modalities for broader spatial tasks.
  • This fusion strategy might allow smaller models to achieve high spatial performance by leveraging existing hierarchical signals more effectively.
  • Integration with embodied agents could lead to better navigation and manipulation without additional training data.

Load-bearing premise

That the geometry encoder contains rich hierarchical signals that can be fused progressively without causing misalignment or reducing the model's language capabilities.

What would settle it

An experiment showing that a single-level deep fusion model matches or exceeds SpatialStack on all 3D benchmarks while maintaining or improving general VLM performance would falsify the need for multi-level fusion.

Figures

Figures reproduced from arXiv: 2603.27437 by Achuta Kadambi, Bangya Liu, Jian Zhang, Shijie Zhou, Zhiwen Fan.

Figure 1
Figure 1. Figure 1: SpatialStack: Layered Geometry-Language Fusion. Conventional VLMs (a) fuse only a single deep geometry feature with vision tokens, which limits both fine-grained spatial understanding and high-level spatial reasoning. SpatialStack (b) instead stacks multi￾level geometry features and injects them hierarchically into successive LLM decoder layers, yielding stronger 3D spatial understanding across benchmarks.… view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of SpatialStack. A standard VLM backbone is coupled with a multi-view geometry encoder whose layer-wise features are processed by layer-specific projectors and sequentially injected into the LLM decoder, progressively integrating geometric cues. Explanation of the similarity heatmaps on the left is provided in Sec. 3. This multi-level injection preserves both fine-grained geometric structure a… view at source ↗
Figure 3
Figure 3. Figure 3: Examples of spatial tasks at different levels. The left example (Low-Level Task) targets fine-grained geometric percep￾tion, such as determining which of two points is closer to the cam￾era. The right example (High-Level Task) requires higher-level spatial reasoning, where the model must estimate the distance be￾tween two objects by comparing their closest points in 3D space. understanding. SPAR divides sp… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of Geometry Injection Layers on Spatial Tasks. Deeper layers improve high-level tasks, while low-level tasks peak at layer 11 and decline at deeper layers, suggesting a trade-off be￾tween fine-grained perception and higher-level reasoning. ent levels of spatial tasks: as the injection layer becomes deeper, the performance on low-level tasks declines, while the performance on high-level tasks improve… view at source ↗
read the original abstract

Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local geometric precision and global contextual semantics. Building upon this framework, we develop VLM-SpatialStack, a model that achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments and ablations demonstrate that our multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks, establishing SpatialStack as an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physical AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes SpatialStack, a hierarchical fusion framework for vision-language models that progressively aligns and synchronizes multi-level geometric features from vision and geometry encoders with the language backbone. This moves beyond conventional late-stage fusion to capture both local geometric precision and global contextual semantics. The authors present VLM-SpatialStack, which they claim achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks, with ablations demonstrating consistent gains and robust generalization across tasks.

Significance. If the empirical results hold, this work offers a general and extensible paradigm for vision-language-geometry integration that could meaningfully advance 3D spatial reasoning in embodied AI systems. The core idea of preserving hierarchical signals across layers directly targets a documented bottleneck in prior multi-view geometry transformers. The reported ablation consistency and cross-task generalization tests provide a foundation for assessing the framework's robustness.

major comments (2)
  1. [Abstract] Abstract and Experiments section: The central claim of state-of-the-art performance and consistent ablation gains is asserted without any quantitative metrics, benchmark names, baseline comparisons, error bars, or dataset descriptions, which are load-bearing for evaluating the empirical contribution.
  2. [Method] Method section (progressive fusion description): The synchronization of multi-level geometric features is presented as avoiding alignment artifacts and capability degradation, but no formal analysis, mathematical characterization of the fusion operation, or targeted ablation isolating artifact introduction is provided to substantiate this assumption.
minor comments (1)
  1. Clarify notation for feature stacking and synchronization operations to ensure reproducibility of the hierarchical alignment process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment point by point below, indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Experiments section: The central claim of state-of-the-art performance and consistent ablation gains is asserted without any quantitative metrics, benchmark names, baseline comparisons, error bars, or dataset descriptions, which are load-bearing for evaluating the empirical contribution.

    Authors: The Experiments section contains full quantitative tables with metrics, benchmark names, baseline comparisons, error bars from repeated runs, and dataset details. The abstract follows standard length constraints by summarizing the key outcome. To make the central claims more immediately verifiable, we will revise the abstract to include one or two concrete performance figures, the primary benchmark names, and a brief mention of the evaluation protocol. This is a partial revision. revision: partial

  2. Referee: [Method] Method section (progressive fusion description): The synchronization of multi-level geometric features is presented as avoiding alignment artifacts and capability degradation, but no formal analysis, mathematical characterization of the fusion operation, or targeted ablation isolating artifact introduction is provided to substantiate this assumption.

    Authors: We agree that a more explicit mathematical characterization and a targeted ablation would strengthen the argument. The current Method section defines the layer-wise alignment and synchronization operations via the progressive fusion equations; the Experiments section already shows that the multi-level approach outperforms late-fusion baselines on multiple tasks. In the revision we will add (i) a compact mathematical formulation of the synchronization operator and (ii) a new ablation that isolates alignment error before and after synchronization. This is a full revision on this point. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes SpatialStack as an architectural framework for progressive multi-level fusion of vision, geometry, and language features, with claims supported by empirical results on external 3D spatial reasoning benchmarks and ablations. No equations, derivations, or predictions are presented that reduce by construction to fitted parameters or self-definitions. Central claims rest on the described hierarchical alignment strategy and its measured performance gains rather than tautological inputs or load-bearing self-citations. The argument is self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that multi-level geometric features contain rich hierarchical signals discarded by late fusion; no free parameters or new physical entities are introduced in the abstract.

axioms (1)
  • domain assumption Multi-level geometric features from vision and geometry encoders contain rich hierarchical signals not captured by deep-layer-only fusion
    Explicitly stated as the core limitation of prior multi-view geometry transformers in the abstract.

pith-pipeline@v0.9.0 · 5523 in / 1208 out tokens · 31603 ms · 2026-05-14T22:00:41.616109+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 4 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

  2. [2]

    Qwen2.5-VL Technical Report, 2025

    Shuai Bai et al. Qwen2.5-VL Technical Report, 2025. 3, 5, 7, 8, 12

  3. [3]

    Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021. 13

  4. [4]

    Omni3d: A large benchmark and model for 3d object detection in the wild

    Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi, Justin Johnson, and Georgia Gkioxari. Omni3d: A large benchmark and model for 3d object detection in the wild. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 13154–13164, 2023. 8

  5. [5]

    Spatialbot: Pre- cise spatial understanding with vision language models

    Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Pre- cise spatial understanding with vision language models. In 2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 9490–9498. IEEE, 2025. 17

  6. [6]

    Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties. InCVPR, pages 14455–14465, 2024. 17

  7. [7]

    Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

    Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024. 7

  8. [8]

    Spatial- rgpt: Grounded spatial reasoning in vision-language mod- els.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Rui- han Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatial- rgpt: Grounded spatial reasoning in vision-language mod- els.Advances in Neural Information Processing Systems, 37:135062–135093, 2024. 17

  9. [9]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 7

  10. [10]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 13

  11. [11]

    InstructBLIP: Towards general-purpose vision- language models with instruction tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision- language models with instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 3

  12. [12]

    Large spatial model: End-to-end unposed images to semantic 3d.Advances in neural information processing systems, 37:40212–40229,

    Zhiwen Fan, Jian Zhang, Wenyan Cong, Peihao Wang, Renjie Li, Kairun Wen, Shijie Zhou, Achuta Kadambi, Zhangyang Wang, Danfei Xu, et al. Large spatial model: End-to-end unposed images to semantic 3d.Advances in neural information processing systems, 37:40212–40229,

  13. [13]

    Vlm-3r: Vision-language models aug- mented with instruction-aligned 3d reconstruction

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, et al. Vlm-3r: Vision-language models aug- mented with instruction-aligned 3d reconstruction. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026. 1, 2, 4, 5, 7, 13, 14

  14. [14]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 7

  15. [15]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Com- puter Vision, pages 148–166. Springer, 2024. 3, 4, 6, 7

  16. [16]

    3d-llm: In- jecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494,

    Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: In- jecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494,

  17. [17]

    An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023

    Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023. 2, 3

  18. [18]

    Think- ing in dynamics: How multimodal large language models perceive, track, and reason dynamics in physical 4d world,

    Yuzhi Huang, Kairun Wen, Rongxin Gao, Dongxuan Liu, Yibin Lou, Jie Wu, Jing Xu, Jian Zhang, Zheng Yang, Yun- long Lin, Chenxin Li, Panwang Pan, Junbin Lu, Jingyan Jiang, Xinghao Ding, Yue Huang, and Zhi Wang. Think- ing in dynamics: How multimodal large language models perceive, track, and reason dynamics in physical 4d world,

  19. [19]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 7, 8

  20. [20]

    Conceptfusion: Open-set multimodal 3d mapping.arXiv preprint arXiv:2302.07241, 2023

    Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala, Qiao Gu, Mohd Omama, Tao Chen, Alaa Maalouf, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Keetha, et al. Conceptfusion: Open-set multimodal 3d mapping.arXiv preprint arXiv:2302.07241, 2023. 4

  21. [21]

    From clip to dino: Visual encoders shout in multi-modal large language models.arXiv preprint arXiv:2310.08825, 2023

    Dongsheng Jiang, Yuchen Liu, Songlin Liu, Jin’e Zhao, Hao Zhang, Zhen Gao, Xiaopeng Zhang, Jin Li, and Hongkai Xiong. From clip to dino: Visual encoders shout in multi-modal large language models.arXiv preprint arXiv:2310.08825, 2023. 3

  22. [22]

    What’s” up” with vision-language models? investigating their strug- gle with spatial reasoning.arXiv preprint arXiv:2310.19785,

    Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s” up” with vision-language models? investigating their strug- gle with spatial reasoning.arXiv preprint arXiv:2310.19785,

  23. [23]

    Ground- ing image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 3

  24. [24]

    Llava-onevision: Easy visual task transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv, 2024. 7, 8

  25. [25]

    BLIP: Bootstrapping Language-Image Pre-training for Uni- fied Vision-Language Understanding and Generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping Language-Image Pre-training for Uni- fied Vision-Language Understanding and Generation. In Proceedings of the 39th International Conference on Ma- chine Learning (ICML), pages 12763–12779. PMLR, 2022. 3

  26. [26]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 3

  27. [27]

    Mini-gemini: Mining the potential of multi-modality vision language models.arXiv preprint arXiv:2403.18814,

    Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models.arXiv preprint arXiv:2403.18814,

  28. [28]

    Sti-bench: Are mllms ready for precise spatial-temporal world understanding?arXiv preprint arXiv:2503.23765, 2025

    Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding?arXiv preprint arXiv:2503.23765, 2025. 2

  29. [29]

    Vila: On pre-training for vi- sual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 26689–26699, 2024. 7

  30. [30]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 8

  31. [31]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 3

  32. [32]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 8

  33. [33]

    Tempcom- pass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcom- pass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024. 8

  34. [34]

    Ssr: Enhancing depth perception in vision-language mod- els via rationale-guided spatial reasoning.arXiv preprint arXiv:2505.12448, 2025

    Yang Liu, Ming Ma, Xiaomin Yu, Pengxiang Ding, Han Zhao, Mingyang Sun, Siteng Huang, and Donglin Wang. Ssr: Enhancing depth perception in vision-language mod- els via rationale-guided spatial reasoning.arXiv preprint arXiv:2505.12448, 2025. 4

  35. [35]

    Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025

    Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025. 1, 3

  36. [36]

    DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs

    Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zux- uan Wu, Jianfeng Gao, and Yu-Gang Jiang. DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 3, 5

  37. [37]

    Be- yond semantics: Rediscovering spatial awareness in vision- language models.arXiv preprint arXiv:2503.17349, 2025

    Jianing Qi, Jiawei Liu, Hao Tang, and Zhigang Zhu. Be- yond semantics: Rediscovering spatial awareness in vision- language models.arXiv preprint arXiv:2503.17349, 2025. 2, 3

  38. [38]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning (ICML), pages 8748–8763. PMLR...

  39. [39]

    Vi- sion transformers for dense prediction

    Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 2, 6

  40. [40]

    Sat: Dynamic spatial aptitude training for multimodal language models

    Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kemb- havi, Bryan A Plummer, Ranjay Krishna, et al. Sat: Dynamic spatial aptitude training for multimodal language models. arXiv preprint arXiv:2412.07755, 2024. 8

  41. [41]

    Structure- from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 4104–4113, 2016. 2

  42. [42]

    Tulip: Towards unified language-image pre- training.arXiv preprint arXiv:2503.15485, 2025

    Zineng Tang, Long Lian, Seun Eisape, XuDong Wang, Roei Herzig, Adam Yala, Alane Suhr, Trevor Darrell, and David M Chan. Tulip: Towards unified language-image pre- training.arXiv preprint arXiv:2503.15485, 2025. 2, 3

  43. [43]

    Qwen3.5: Accelerating productivity with na- tive multimodal agents, 2026

    Qwen Team. Qwen3.5: Accelerating productivity with na- tive multimodal agents, 2026. 5, 7, 8, 12

  44. [44]

    Splattalk: 3d vqa with gaussian splatting.Proceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), 2025

    Anh Thai, Songyou Peng, Kyle Genova, Leonidas Guibas, and Thomas Funkhouser. Splattalk: 3d vqa with gaussian splatting.Proceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), 2025. 4

  45. [45]

    Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

    Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024. 3, 6, 7, 8

  46. [46]

    Vggt: Vi- sual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 3, 4, 6, 12

  47. [47]

    Continuous 3d per- ception model with persistent state

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d per- ception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025. 2, 3 10

  48. [48]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 2, 3

  49. [49]

    Dynamicverse: A physically- aware multimodal framework for 4d world modeling.arXiv preprint arXiv:2512.03000, 2025

    Kairun Wen, Yuzhi Huang, Runyu Chen, Hui Zheng, Yun- long Lin, Panwang Pan, Chenxin Li, Wenyan Cong, Jian Zhang, Junbin Lu, et al. Dynamicverse: A physically- aware multimodal framework for 4d world modeling.arXiv preprint arXiv:2512.03000, 2025. 3

  50. [50]

    Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.Advances in neural information process- ing systems, 2025

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.Advances in neural information process- ing systems, 2025. 1, 2, 4, 5, 7

  51. [51]

    Thinking in space: How mul- timodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 2, 3, 6, 7, 16

  52. [52]

    Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

    Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wen- qian Wang, et al. Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025. 1, 3

  53. [53]

    Cambrian-s: Towards spatial supersens- ing in video.arXiv preprint arXiv:2511.04670, 2025

    Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zi- hao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersens- ing in video.arXiv preprint arXiv:2511.04670, 2025. 1, 3, 5, 7, 8, 13, 14

  54. [54]

    Scannet++: A high-fidelity dataset of 3d in- door scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d in- door scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023. 13

  55. [55]

    Root mean square layer nor- malization.Advances in neural information processing sys- tems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer nor- malization.Advances in neural information processing sys- tems, 32, 2019. 12

  56. [56]

    From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976, 2025

    Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976, 2025. 3, 4, 6, 7, 8, 13

  57. [57]

    Long context transfer from language to vision.arXiv, 2024

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv, 2024. 7

  58. [58]

    Direct preference op- timization of video large multimodal models from language model reward

    Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexan- der G Hauptmann, Yonatan Bisk, et al. Direct preference op- timization of video large multimodal models from language model reward. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Lingu...

  59. [59]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 7

  60. [60]

    Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.Advances in neural information processing systems, 2025

    Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.Advances in neural information processing systems, 2025. 1, 2, 4, 5, 6, 7, 8, 13, 14, 15, 16

  61. [61]

    Video-3d llm: Learning position-aware video representation for 3d scene understanding

    Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8995–9006, 2025. 2, 3

  62. [62]

    Scene parsing through ade20k dataset

    Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641,

  63. [63]

    Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields

    Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Ze- hao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024. 4

  64. [64]

    Feature4x: Bridging any monocular video to 4d agentic ai with versatile gaussian feature fields

    Shijie Zhou, Hui Ren, Yijia Weng, Shuwang Zhang, Zhen Wang, Dejia Xu, Zhiwen Fan, Suya You, Zhangyang Wang, Leonidas Guibas, et al. Feature4x: Bridging any monocular video to 4d agentic ai with versatile gaussian feature fields. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 14179–14190, 2025. 4

  65. [65]

    Vlm4d: To- wards spatiotemporal awareness in vision language models

    Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Xin Eric Wang, and Achuta Kadambi. Vlm4d: To- wards spatiotemporal awareness in vision language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 8600–8612, 2025. 2, 3

  66. [66]

    Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities

    Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4295–4305, 2025. 2, 3

  67. [67]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 3 11 SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning Supplementary Material In this supplementary material, we provide comprehen- si...