arxiv: 2603.27437 · v3 · submitted 2026-03-28 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

Jian Zhang , Shijie Zhou , Bangya Liu , Achuta Kadambi , Zhiwen Fan

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D spatial reasoningvision language modelshierarchical fusiongeometry integrationmultimodal learningspatial understandingembodied AI

0 comments

The pith

SpatialStack progressively fuses multi-level geometric features into the language backbone to enable accurate 3D spatial reasoning in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SpatialStack as a general framework for fusing vision, geometry, and language representations hierarchically. Unlike prior methods that only combine deep features, it stacks geometric signals at multiple levels to capture both precise local details and broader context. This leads to a model called VLM-SpatialStack that sets new performance records on 3D spatial reasoning tasks. A sympathetic reader would care because current VLMs often fail at reliable spatial understanding needed for physical AI applications like robotics.

Core claim

SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, moving beyond late-stage fusion to progressively align representations across the model hierarchy, thereby capturing both local geometric precision and global contextual semantics.

What carries the argument

The progressive multi-level fusion mechanism that aligns vision, geometry, and language features at each layer of the hierarchy.

If this is right

Consistently improves 3D understanding across benchmarks.
Achieves state-of-the-art results on multiple 3D spatial reasoning tasks.
Generalizes robustly to diverse spatial reasoning problems.
Provides an extensible paradigm for integrating geometry into vision-language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future models could apply similar stacking to other geometric encoders or modalities for broader spatial tasks.
This fusion strategy might allow smaller models to achieve high spatial performance by leveraging existing hierarchical signals more effectively.
Integration with embodied agents could lead to better navigation and manipulation without additional training data.

Load-bearing premise

That the geometry encoder contains rich hierarchical signals that can be fused progressively without causing misalignment or reducing the model's language capabilities.

What would settle it

An experiment showing that a single-level deep fusion model matches or exceeds SpatialStack on all 3D benchmarks while maintaining or improving general VLM performance would falsify the need for multi-level fusion.

Figures

Figures reproduced from arXiv: 2603.27437 by Achuta Kadambi, Bangya Liu, Jian Zhang, Shijie Zhou, Zhiwen Fan.

**Figure 1.** Figure 1: SpatialStack: Layered Geometry-Language Fusion. Conventional VLMs (a) fuse only a single deep geometry feature with vision tokens, which limits both fine-grained spatial understanding and high-level spatial reasoning. SpatialStack (b) instead stacks multilevel geometry features and injects them hierarchically into successive LLM decoder layers, yielding stronger 3D spatial understanding across benchmarks.… view at source ↗

**Figure 2.** Figure 2: Architecture of SpatialStack. A standard VLM backbone is coupled with a multi-view geometry encoder whose layer-wise features are processed by layer-specific projectors and sequentially injected into the LLM decoder, progressively integrating geometric cues. Explanation of the similarity heatmaps on the left is provided in Sec. 3. This multi-level injection preserves both fine-grained geometric structure a… view at source ↗

**Figure 3.** Figure 3: Examples of spatial tasks at different levels. The left example (Low-Level Task) targets fine-grained geometric perception, such as determining which of two points is closer to the camera. The right example (High-Level Task) requires higher-level spatial reasoning, where the model must estimate the distance between two objects by comparing their closest points in 3D space. understanding. SPAR divides sp… view at source ↗

**Figure 4.** Figure 4: Effect of Geometry Injection Layers on Spatial Tasks. Deeper layers improve high-level tasks, while low-level tasks peak at layer 11 and decline at deeper layers, suggesting a trade-off between fine-grained perception and higher-level reasoning. ent levels of spatial tasks: as the injection layer becomes deeper, the performance on low-level tasks declines, while the performance on high-level tasks improve… view at source ↗

read the original abstract

Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local geometric precision and global contextual semantics. Building upon this framework, we develop VLM-SpatialStack, a model that achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments and ablations demonstrate that our multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks, establishing SpatialStack as an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physical AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpatialStack's progressive multi-level geometry stacking into the language backbone is a straightforward fix for late-fusion limits in 3D VLMs, but the abstract's SOTA claims are hard to weigh without numbers.

read the letter

The main takeaway is that this paper pushes a hierarchical fusion approach where geometric features from multiple layers get stacked and synced with the language backbone, rather than just merging deep features at the end. That directly targets the bottleneck they describe in prior work on multi-view geometry transformers for VLMs. The construction itself is clean on paper: it aims to keep local geometric precision while preserving global semantics through progressive alignment across the model hierarchy. If the full experiments back this up with consistent ablation gains, it gives a practical template for anyone trying to improve spatial reasoning in embodied AI setups. The framework looks general enough that others could adapt the stacking pattern without starting from scratch. The soft spots sit mostly in the evidence. The abstract asserts state-of-the-art results on multiple benchmarks plus robust generalization, yet supplies no quantitative details, error bars, or dataset breakdowns, so it's impossible to tell whether the gains are large enough to matter or whether they survive standard controls. The central assumption—that the geometry encoder actually carries rich hierarchical signals that fuse without creating alignment artifacts or eroding language performance—also needs tighter validation than the summary provides. This work is for people building or extending VLMs for 3D tasks such as robotics or scene understanding. A reader focused on fusion architectures would get usable ideas from the layered design even if they end up tweaking the implementation. It deserves peer review because the problem is real and the proposed fix is incremental but well-motivated; referees can pressure-test the numbers and check for hidden costs in the fusion process.

Referee Report

2 major / 1 minor

Summary. The paper proposes SpatialStack, a hierarchical fusion framework for vision-language models that progressively aligns and synchronizes multi-level geometric features from vision and geometry encoders with the language backbone. This moves beyond conventional late-stage fusion to capture both local geometric precision and global contextual semantics. The authors present VLM-SpatialStack, which they claim achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks, with ablations demonstrating consistent gains and robust generalization across tasks.

Significance. If the empirical results hold, this work offers a general and extensible paradigm for vision-language-geometry integration that could meaningfully advance 3D spatial reasoning in embodied AI systems. The core idea of preserving hierarchical signals across layers directly targets a documented bottleneck in prior multi-view geometry transformers. The reported ablation consistency and cross-task generalization tests provide a foundation for assessing the framework's robustness.

major comments (2)

[Abstract] Abstract and Experiments section: The central claim of state-of-the-art performance and consistent ablation gains is asserted without any quantitative metrics, benchmark names, baseline comparisons, error bars, or dataset descriptions, which are load-bearing for evaluating the empirical contribution.
[Method] Method section (progressive fusion description): The synchronization of multi-level geometric features is presented as avoiding alignment artifacts and capability degradation, but no formal analysis, mathematical characterization of the fusion operation, or targeted ablation isolating artifact introduction is provided to substantiate this assumption.

minor comments (1)

Clarify notation for feature stacking and synchronization operations to ensure reproducibility of the hierarchical alignment process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment point by point below, indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and Experiments section: The central claim of state-of-the-art performance and consistent ablation gains is asserted without any quantitative metrics, benchmark names, baseline comparisons, error bars, or dataset descriptions, which are load-bearing for evaluating the empirical contribution.

Authors: The Experiments section contains full quantitative tables with metrics, benchmark names, baseline comparisons, error bars from repeated runs, and dataset details. The abstract follows standard length constraints by summarizing the key outcome. To make the central claims more immediately verifiable, we will revise the abstract to include one or two concrete performance figures, the primary benchmark names, and a brief mention of the evaluation protocol. This is a partial revision. revision: partial
Referee: [Method] Method section (progressive fusion description): The synchronization of multi-level geometric features is presented as avoiding alignment artifacts and capability degradation, but no formal analysis, mathematical characterization of the fusion operation, or targeted ablation isolating artifact introduction is provided to substantiate this assumption.

Authors: We agree that a more explicit mathematical characterization and a targeted ablation would strengthen the argument. The current Method section defines the layer-wise alignment and synchronization operations via the progressive fusion equations; the Experiments section already shows that the multi-level approach outperforms late-fusion baselines on multiple tasks. In the revision we will add (i) a compact mathematical formulation of the synchronization operator and (ii) a new ablation that isolates alignment error before and after synchronization. This is a full revision on this point. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes SpatialStack as an architectural framework for progressive multi-level fusion of vision, geometry, and language features, with claims supported by empirical results on external 3D spatial reasoning benchmarks and ablations. No equations, derivations, or predictions are presented that reduce by construction to fitted parameters or self-definitions. Central claims rest on the described hierarchical alignment strategy and its measured performance gains rather than tautological inputs or load-bearing self-citations. The argument is self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that multi-level geometric features contain rich hierarchical signals discarded by late fusion; no free parameters or new physical entities are introduced in the abstract.

axioms (1)

domain assumption Multi-level geometric features from vision and geometry encoders contain rich hierarchical signals not captured by deep-layer-only fusion
Explicitly stated as the core limitation of prior multi-view geometry transformers in the abstract.

pith-pipeline@v0.9.0 · 5523 in / 1208 out tokens · 31603 ms · 2026-05-14T22:00:41.616109+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SpatialStack stacks and synchronizes multi-level geometric features with the language backbone... shallow layers capture fine spatial details and deeper layers encode global structure
IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

progressively aligns vision, geometry, and language representations across the model hierarchy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 4 internal anchors

[1]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

work page
[2]

Qwen2.5-VL Technical Report, 2025

Shuai Bai et al. Qwen2.5-VL Technical Report, 2025. 3, 5, 7, 8, 12

work page 2025
[3]

Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021. 13

work page arXiv 2021
[4]

Omni3d: A large benchmark and model for 3d object detection in the wild

Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi, Justin Johnson, and Georgia Gkioxari. Omni3d: A large benchmark and model for 3d object detection in the wild. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 13154–13164, 2023. 8

work page 2023
[5]

Spatialbot: Pre- cise spatial understanding with vision language models

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Pre- cise spatial understanding with vision language models. In 2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 9490–9498. IEEE, 2025. 17

work page 2025
[6]

Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties. InCVPR, pages 14455–14465, 2024. 17

work page 2024
[7]

Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024. 7

work page arXiv 2024
[8]

Spatial- rgpt: Grounded spatial reasoning in vision-language mod- els.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Rui- han Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatial- rgpt: Grounded spatial reasoning in vision-language mod- els.Advances in Neural Information Processing Systems, 37:135062–135093, 2024. 17

work page 2024
[9]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 13

work page 2017
[11]

InstructBLIP: Towards general-purpose vision- language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision- language models with instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 3

work page 2023
[12]

Large spatial model: End-to-end unposed images to semantic 3d.Advances in neural information processing systems, 37:40212–40229,

Zhiwen Fan, Jian Zhang, Wenyan Cong, Peihao Wang, Renjie Li, Kairun Wen, Shijie Zhou, Achuta Kadambi, Zhangyang Wang, Danfei Xu, et al. Large spatial model: End-to-end unposed images to semantic 3d.Advances in neural information processing systems, 37:40212–40229,

work page
[13]

Vlm-3r: Vision-language models aug- mented with instruction-aligned 3d reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, et al. Vlm-3r: Vision-language models aug- mented with instruction-aligned 3d reconstruction. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026. 1, 2, 4, 5, 7, 13, 14

work page 2026
[14]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 7

work page 2025
[15]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Com- puter Vision, pages 148–166. Springer, 2024. 3, 4, 6, 7

work page 2024
[16]

3d-llm: In- jecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494,

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: In- jecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494,

work page
[17]

An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023. 2, 3

work page arXiv 2023
[18]

Think- ing in dynamics: How multimodal large language models perceive, track, and reason dynamics in physical 4d world,

Yuzhi Huang, Kairun Wen, Rongxin Gao, Dongxuan Liu, Yibin Lou, Jie Wu, Jing Xu, Jian Zhang, Zheng Yang, Yun- long Lin, Chenxin Li, Panwang Pan, Junbin Lu, Jingyan Jiang, Xinghao Ding, Yue Huang, and Zhi Wang. Think- ing in dynamics: How multimodal large language models perceive, track, and reason dynamics in physical 4d world,

work page
[19]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Conceptfusion: Open-set multimodal 3d mapping.arXiv preprint arXiv:2302.07241, 2023

Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala, Qiao Gu, Mohd Omama, Tao Chen, Alaa Maalouf, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Keetha, et al. Conceptfusion: Open-set multimodal 3d mapping.arXiv preprint arXiv:2302.07241, 2023. 4

work page arXiv 2023
[21]

From clip to dino: Visual encoders shout in multi-modal large language models.arXiv preprint arXiv:2310.08825, 2023

Dongsheng Jiang, Yuchen Liu, Songlin Liu, Jin’e Zhao, Hao Zhang, Zhen Gao, Xiaopeng Zhang, Jin Li, and Hongkai Xiong. From clip to dino: Visual encoders shout in multi-modal large language models.arXiv preprint arXiv:2310.08825, 2023. 3

work page arXiv 2023
[22]

What’s” up” with vision-language models? investigating their strug- gle with spatial reasoning.arXiv preprint arXiv:2310.19785,

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s” up” with vision-language models? investigating their strug- gle with spatial reasoning.arXiv preprint arXiv:2310.19785,

work page arXiv
[23]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 3

work page 2024
[24]

Llava-onevision: Easy visual task transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv, 2024. 7, 8

work page 2024
[25]

BLIP: Bootstrapping Language-Image Pre-training for Uni- fied Vision-Language Understanding and Generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping Language-Image Pre-training for Uni- fied Vision-Language Understanding and Generation. In Proceedings of the 39th International Conference on Ma- chine Learning (ICML), pages 12763–12779. PMLR, 2022. 3

work page 2022
[26]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 3

work page 2023
[27]

Mini-gemini: Mining the potential of multi-modality vision language models.arXiv preprint arXiv:2403.18814,

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models.arXiv preprint arXiv:2403.18814,

work page arXiv
[28]

Sti-bench: Are mllms ready for precise spatial-temporal world understanding?arXiv preprint arXiv:2503.23765, 2025

Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding?arXiv preprint arXiv:2503.23765, 2025. 2

work page arXiv 2025
[29]

Vila: On pre-training for vi- sual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 26689–26699, 2024. 7

work page 2024
[30]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 8

work page 2014
[31]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 3

work page 2023
[32]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 8

work page 2024
[33]

Tempcom- pass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcom- pass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024. 8

work page 2024
[34]

Ssr: Enhancing depth perception in vision-language mod- els via rationale-guided spatial reasoning.arXiv preprint arXiv:2505.12448, 2025

Yang Liu, Ming Ma, Xiaomin Yu, Pengxiang Ding, Han Zhao, Mingyang Sun, Siteng Huang, and Donglin Wang. Ssr: Enhancing depth perception in vision-language mod- els via rationale-guided spatial reasoning.arXiv preprint arXiv:2505.12448, 2025. 4

work page arXiv 2025
[35]

Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025

Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025. 1, 3

work page arXiv 2025
[36]

DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs

Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zux- uan Wu, Jianfeng Gao, and Yu-Gang Jiang. DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 3, 5

work page 2024
[37]

Be- yond semantics: Rediscovering spatial awareness in vision- language models.arXiv preprint arXiv:2503.17349, 2025

Jianing Qi, Jiawei Liu, Hao Tang, and Zhigang Zhu. Be- yond semantics: Rediscovering spatial awareness in vision- language models.arXiv preprint arXiv:2503.17349, 2025. 2, 3

work page arXiv 2025
[38]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning (ICML), pages 8748–8763. PMLR...

work page 2021
[39]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 2, 6

work page 2021
[40]

Sat: Dynamic spatial aptitude training for multimodal language models

Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kemb- havi, Bryan A Plummer, Ranjay Krishna, et al. Sat: Dynamic spatial aptitude training for multimodal language models. arXiv preprint arXiv:2412.07755, 2024. 8

work page arXiv 2024
[41]

Structure- from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 4104–4113, 2016. 2

work page 2016
[42]

Tulip: Towards unified language-image pre- training.arXiv preprint arXiv:2503.15485, 2025

Zineng Tang, Long Lian, Seun Eisape, XuDong Wang, Roei Herzig, Adam Yala, Alane Suhr, Trevor Darrell, and David M Chan. Tulip: Towards unified language-image pre- training.arXiv preprint arXiv:2503.15485, 2025. 2, 3

work page arXiv 2025
[43]

Qwen3.5: Accelerating productivity with na- tive multimodal agents, 2026

Qwen Team. Qwen3.5: Accelerating productivity with na- tive multimodal agents, 2026. 5, 7, 8, 12

work page 2026
[44]

Splattalk: 3d vqa with gaussian splatting.Proceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), 2025

Anh Thai, Songyou Peng, Kyle Genova, Leonidas Guibas, and Thomas Funkhouser. Splattalk: 3d vqa with gaussian splatting.Proceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), 2025. 4

work page 2025
[45]

Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024. 3, 6, 7, 8

work page 2024
[46]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 3, 4, 6, 12

work page 2025
[47]

Continuous 3d per- ception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d per- ception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025. 2, 3 10

work page 2025
[48]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 2, 3

work page 2024
[49]

Dynamicverse: A physically- aware multimodal framework for 4d world modeling.arXiv preprint arXiv:2512.03000, 2025

Kairun Wen, Yuzhi Huang, Runyu Chen, Hui Zheng, Yun- long Lin, Panwang Pan, Chenxin Li, Wenyan Cong, Jian Zhang, Junbin Lu, et al. Dynamicverse: A physically- aware multimodal framework for 4d world modeling.arXiv preprint arXiv:2512.03000, 2025. 3

work page arXiv 2025
[50]

Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.Advances in neural information process- ing systems, 2025

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.Advances in neural information process- ing systems, 2025. 1, 2, 4, 5, 7

work page 2025
[51]

Thinking in space: How mul- timodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 2, 3, 6, 7, 16

work page 2025
[52]

Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wen- qian Wang, et al. Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025. 1, 3

work page arXiv 2025
[53]

Cambrian-s: Towards spatial supersens- ing in video.arXiv preprint arXiv:2511.04670, 2025

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zi- hao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersens- ing in video.arXiv preprint arXiv:2511.04670, 2025. 1, 3, 5, 7, 8, 13, 14

work page arXiv 2025
[54]

Scannet++: A high-fidelity dataset of 3d in- door scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d in- door scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023. 13

work page 2023
[55]

Root mean square layer nor- malization.Advances in neural information processing sys- tems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer nor- malization.Advances in neural information processing sys- tems, 32, 2019. 12

work page 2019
[56]

From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976, 2025

Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976, 2025. 3, 4, 6, 7, 8, 13

work page arXiv 2025
[57]

Long context transfer from language to vision.arXiv, 2024

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv, 2024. 7

work page 2024
[58]

Direct preference op- timization of video large multimodal models from language model reward

Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexan- der G Hauptmann, Yonatan Bisk, et al. Direct preference op- timization of video large multimodal models from language model reward. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Lingu...

work page 2025
[59]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.Advances in neural information processing systems, 2025

Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.Advances in neural information processing systems, 2025. 1, 2, 4, 5, 6, 7, 8, 13, 14, 15, 16

work page 2025
[61]

Video-3d llm: Learning position-aware video representation for 3d scene understanding

Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8995–9006, 2025. 2, 3

work page 2025
[62]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641,

work page
[63]

Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields

Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Ze- hao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024. 4

work page 2024
[64]

Feature4x: Bridging any monocular video to 4d agentic ai with versatile gaussian feature fields

Shijie Zhou, Hui Ren, Yijia Weng, Shuwang Zhang, Zhen Wang, Dejia Xu, Zhiwen Fan, Suya You, Zhangyang Wang, Leonidas Guibas, et al. Feature4x: Bridging any monocular video to 4d agentic ai with versatile gaussian feature fields. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 14179–14190, 2025. 4

work page 2025
[65]

Vlm4d: To- wards spatiotemporal awareness in vision language models

Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Xin Eric Wang, and Achuta Kadambi. Vlm4d: To- wards spatiotemporal awareness in vision language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 8600–8612, 2025. 2, 3

work page 2025
[66]

Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities

Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4295–4305, 2025. 2, 3

work page 2025
[67]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 3 11 SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning Supplementary Material In this supplementary material, we provide comprehen- si...

work page internal anchor Pith review Pith/arXiv arXiv 2023