Recognition: 2 theorem links
· Lean TheoremSpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
Pith reviewed 2026-05-14 22:00 UTC · model grok-4.3
The pith
SpatialStack progressively fuses multi-level geometric features into the language backbone to enable accurate 3D spatial reasoning in vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, moving beyond late-stage fusion to progressively align representations across the model hierarchy, thereby capturing both local geometric precision and global contextual semantics.
What carries the argument
The progressive multi-level fusion mechanism that aligns vision, geometry, and language features at each layer of the hierarchy.
If this is right
- Consistently improves 3D understanding across benchmarks.
- Achieves state-of-the-art results on multiple 3D spatial reasoning tasks.
- Generalizes robustly to diverse spatial reasoning problems.
- Provides an extensible paradigm for integrating geometry into vision-language models.
Where Pith is reading between the lines
- Future models could apply similar stacking to other geometric encoders or modalities for broader spatial tasks.
- This fusion strategy might allow smaller models to achieve high spatial performance by leveraging existing hierarchical signals more effectively.
- Integration with embodied agents could lead to better navigation and manipulation without additional training data.
Load-bearing premise
That the geometry encoder contains rich hierarchical signals that can be fused progressively without causing misalignment or reducing the model's language capabilities.
What would settle it
An experiment showing that a single-level deep fusion model matches or exceeds SpatialStack on all 3D benchmarks while maintaining or improving general VLM performance would falsify the need for multi-level fusion.
Figures
read the original abstract
Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local geometric precision and global contextual semantics. Building upon this framework, we develop VLM-SpatialStack, a model that achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments and ablations demonstrate that our multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks, establishing SpatialStack as an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physical AI systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SpatialStack, a hierarchical fusion framework for vision-language models that progressively aligns and synchronizes multi-level geometric features from vision and geometry encoders with the language backbone. This moves beyond conventional late-stage fusion to capture both local geometric precision and global contextual semantics. The authors present VLM-SpatialStack, which they claim achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks, with ablations demonstrating consistent gains and robust generalization across tasks.
Significance. If the empirical results hold, this work offers a general and extensible paradigm for vision-language-geometry integration that could meaningfully advance 3D spatial reasoning in embodied AI systems. The core idea of preserving hierarchical signals across layers directly targets a documented bottleneck in prior multi-view geometry transformers. The reported ablation consistency and cross-task generalization tests provide a foundation for assessing the framework's robustness.
major comments (2)
- [Abstract] Abstract and Experiments section: The central claim of state-of-the-art performance and consistent ablation gains is asserted without any quantitative metrics, benchmark names, baseline comparisons, error bars, or dataset descriptions, which are load-bearing for evaluating the empirical contribution.
- [Method] Method section (progressive fusion description): The synchronization of multi-level geometric features is presented as avoiding alignment artifacts and capability degradation, but no formal analysis, mathematical characterization of the fusion operation, or targeted ablation isolating artifact introduction is provided to substantiate this assumption.
minor comments (1)
- Clarify notation for feature stacking and synchronization operations to ensure reproducibility of the hierarchical alignment process.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. We address each major comment point by point below, indicating the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract and Experiments section: The central claim of state-of-the-art performance and consistent ablation gains is asserted without any quantitative metrics, benchmark names, baseline comparisons, error bars, or dataset descriptions, which are load-bearing for evaluating the empirical contribution.
Authors: The Experiments section contains full quantitative tables with metrics, benchmark names, baseline comparisons, error bars from repeated runs, and dataset details. The abstract follows standard length constraints by summarizing the key outcome. To make the central claims more immediately verifiable, we will revise the abstract to include one or two concrete performance figures, the primary benchmark names, and a brief mention of the evaluation protocol. This is a partial revision. revision: partial
-
Referee: [Method] Method section (progressive fusion description): The synchronization of multi-level geometric features is presented as avoiding alignment artifacts and capability degradation, but no formal analysis, mathematical characterization of the fusion operation, or targeted ablation isolating artifact introduction is provided to substantiate this assumption.
Authors: We agree that a more explicit mathematical characterization and a targeted ablation would strengthen the argument. The current Method section defines the layer-wise alignment and synchronization operations via the progressive fusion equations; the Experiments section already shows that the multi-level approach outperforms late-fusion baselines on multiple tasks. In the revision we will add (i) a compact mathematical formulation of the synchronization operator and (ii) a new ablation that isolates alignment error before and after synchronization. This is a full revision on this point. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper proposes SpatialStack as an architectural framework for progressive multi-level fusion of vision, geometry, and language features, with claims supported by empirical results on external 3D spatial reasoning benchmarks and ablations. No equations, derivations, or predictions are presented that reduce by construction to fitted parameters or self-definitions. Central claims rest on the described hierarchical alignment strategy and its measured performance gains rather than tautological inputs or load-bearing self-citations. The argument is self-contained against the stated benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multi-level geometric features from vision and geometry encoders contain rich hierarchical signals not captured by deep-layer-only fusion
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SpatialStack stacks and synchronizes multi-level geometric features with the language backbone... shallow layers capture fine spatial details and deeper layers encode global structure
-
IndisputableMonolith/Foundation/DimensionForcing.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
progressively aligns vision, geometry, and language representations across the model hierarchy
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,
-
[2]
Qwen2.5-VL Technical Report, 2025
Shuai Bai et al. Qwen2.5-VL Technical Report, 2025. 3, 5, 7, 8, 12
work page 2025
-
[3]
Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021. 13
-
[4]
Omni3d: A large benchmark and model for 3d object detection in the wild
Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi, Justin Johnson, and Georgia Gkioxari. Omni3d: A large benchmark and model for 3d object detection in the wild. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 13154–13164, 2023. 8
work page 2023
-
[5]
Spatialbot: Pre- cise spatial understanding with vision language models
Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Pre- cise spatial understanding with vision language models. In 2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 9490–9498. IEEE, 2025. 17
work page 2025
-
[6]
Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties. InCVPR, pages 14455–14465, 2024. 17
work page 2024
-
[7]
Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024. 7
-
[8]
An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Rui- han Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatial- rgpt: Grounded spatial reasoning in vision-language mod- els.Advances in Neural Information Processing Systems, 37:135062–135093, 2024. 17
work page 2024
-
[9]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 13
work page 2017
-
[11]
InstructBLIP: Towards general-purpose vision- language models with instruction tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision- language models with instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 3
work page 2023
-
[12]
Zhiwen Fan, Jian Zhang, Wenyan Cong, Peihao Wang, Renjie Li, Kairun Wen, Shijie Zhou, Achuta Kadambi, Zhangyang Wang, Danfei Xu, et al. Large spatial model: End-to-end unposed images to semantic 3d.Advances in neural information processing systems, 37:40212–40229,
-
[13]
Vlm-3r: Vision-language models aug- mented with instruction-aligned 3d reconstruction
Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, et al. Vlm-3r: Vision-language models aug- mented with instruction-aligned 3d reconstruction. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026. 1, 2, 4, 5, 7, 13, 14
work page 2026
-
[14]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 7
work page 2025
-
[15]
Blink: Multimodal large language models can see but not perceive
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Com- puter Vision, pages 148–166. Springer, 2024. 3, 4, 6, 7
work page 2024
-
[16]
Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: In- jecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494,
-
[17]
An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023
Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023. 2, 3
-
[18]
Yuzhi Huang, Kairun Wen, Rongxin Gao, Dongxuan Liu, Yibin Lou, Jie Wu, Jing Xu, Jian Zhang, Zheng Yang, Yun- long Lin, Chenxin Li, Panwang Pan, Junbin Lu, Jingyan Jiang, Xinghao Ding, Yue Huang, and Zhi Wang. Think- ing in dynamics: How multimodal large language models perceive, track, and reason dynamics in physical 4d world,
-
[19]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Conceptfusion: Open-set multimodal 3d mapping.arXiv preprint arXiv:2302.07241, 2023
Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala, Qiao Gu, Mohd Omama, Tao Chen, Alaa Maalouf, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Keetha, et al. Conceptfusion: Open-set multimodal 3d mapping.arXiv preprint arXiv:2302.07241, 2023. 4
-
[21]
Dongsheng Jiang, Yuchen Liu, Songlin Liu, Jin’e Zhao, Hao Zhang, Zhen Gao, Xiaopeng Zhang, Jin Li, and Hongkai Xiong. From clip to dino: Visual encoders shout in multi-modal large language models.arXiv preprint arXiv:2310.08825, 2023. 3
-
[22]
Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s” up” with vision-language models? investigating their strug- gle with spatial reasoning.arXiv preprint arXiv:2310.19785,
-
[23]
Ground- ing image matching in 3d with mast3r
Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 3
work page 2024
-
[24]
Llava-onevision: Easy visual task transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv, 2024. 7, 8
work page 2024
-
[25]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping Language-Image Pre-training for Uni- fied Vision-Language Understanding and Generation. In Proceedings of the 39th International Conference on Ma- chine Learning (ICML), pages 12763–12779. PMLR, 2022. 3
work page 2022
-
[26]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 3
work page 2023
-
[27]
Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models.arXiv preprint arXiv:2403.18814,
-
[28]
Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding?arXiv preprint arXiv:2503.23765, 2025. 2
-
[29]
Vila: On pre-training for vi- sual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 26689–26699, 2024. 7
work page 2024
-
[30]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 8
work page 2014
-
[31]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 3
work page 2023
-
[32]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 8
work page 2024
-
[33]
Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcom- pass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024. 8
work page 2024
-
[34]
Yang Liu, Ming Ma, Xiaomin Yu, Pengxiang Ding, Han Zhao, Mingyang Sun, Siteng Huang, and Donglin Wang. Ssr: Enhancing depth perception in vision-language mod- els via rationale-guided spatial reasoning.arXiv preprint arXiv:2505.12448, 2025. 4
-
[35]
Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025. 1, 3
-
[36]
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs
Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zux- uan Wu, Jianfeng Gao, and Yu-Gang Jiang. DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 3, 5
work page 2024
-
[37]
Jianing Qi, Jiawei Liu, Hao Tang, and Zhigang Zhu. Be- yond semantics: Rediscovering spatial awareness in vision- language models.arXiv preprint arXiv:2503.17349, 2025. 2, 3
-
[38]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning (ICML), pages 8748–8763. PMLR...
work page 2021
-
[39]
Vi- sion transformers for dense prediction
Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 2, 6
work page 2021
-
[40]
Sat: Dynamic spatial aptitude training for multimodal language models
Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kemb- havi, Bryan A Plummer, Ranjay Krishna, et al. Sat: Dynamic spatial aptitude training for multimodal language models. arXiv preprint arXiv:2412.07755, 2024. 8
-
[41]
Structure- from-motion revisited
Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 4104–4113, 2016. 2
work page 2016
-
[42]
Tulip: Towards unified language-image pre- training.arXiv preprint arXiv:2503.15485, 2025
Zineng Tang, Long Lian, Seun Eisape, XuDong Wang, Roei Herzig, Adam Yala, Alane Suhr, Trevor Darrell, and David M Chan. Tulip: Towards unified language-image pre- training.arXiv preprint arXiv:2503.15485, 2025. 2, 3
-
[43]
Qwen3.5: Accelerating productivity with na- tive multimodal agents, 2026
Qwen Team. Qwen3.5: Accelerating productivity with na- tive multimodal agents, 2026. 5, 7, 8, 12
work page 2026
-
[44]
Anh Thai, Songyou Peng, Kyle Genova, Leonidas Guibas, and Thomas Funkhouser. Splattalk: 3d vqa with gaussian splatting.Proceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), 2025. 4
work page 2025
-
[45]
Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric explo- ration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024. 3, 6, 7, 8
work page 2024
-
[46]
Vggt: Vi- sual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 3, 4, 6, 12
work page 2025
-
[47]
Continuous 3d per- ception model with persistent state
Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d per- ception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025. 2, 3 10
work page 2025
-
[48]
Dust3r: Geometric 3d vi- sion made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 2, 3
work page 2024
-
[49]
Kairun Wen, Yuzhi Huang, Runyu Chen, Hui Zheng, Yun- long Lin, Panwang Pan, Chenxin Li, Wenyan Cong, Jian Zhang, Junbin Lu, et al. Dynamicverse: A physically- aware multimodal framework for 4d world modeling.arXiv preprint arXiv:2512.03000, 2025. 3
-
[50]
Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.Advances in neural information process- ing systems, 2025. 1, 2, 4, 5, 7
work page 2025
-
[51]
Thinking in space: How mul- timodal large language models see, remember, and recall spaces
Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 2, 3, 6, 7, 16
work page 2025
-
[52]
Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025
Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wen- qian Wang, et al. Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025. 1, 3
-
[53]
Cambrian-s: Towards spatial supersens- ing in video.arXiv preprint arXiv:2511.04670, 2025
Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zi- hao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersens- ing in video.arXiv preprint arXiv:2511.04670, 2025. 1, 3, 5, 7, 8, 13, 14
-
[54]
Scannet++: A high-fidelity dataset of 3d in- door scenes
Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d in- door scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023. 13
work page 2023
-
[55]
Root mean square layer nor- malization.Advances in neural information processing sys- tems, 32, 2019
Biao Zhang and Rico Sennrich. Root mean square layer nor- malization.Advances in neural information processing sys- tems, 32, 2019. 12
work page 2019
-
[56]
Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976, 2025. 3, 4, 6, 7, 8, 13
-
[57]
Long context transfer from language to vision.arXiv, 2024
Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv, 2024. 7
work page 2024
-
[58]
Direct preference op- timization of video large multimodal models from language model reward
Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexan- der G Hauptmann, Yonatan Bisk, et al. Direct preference op- timization of video large multimodal models from language model reward. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Lingu...
work page 2025
-
[59]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.Advances in neural information processing systems, 2025. 1, 2, 4, 5, 6, 7, 8, 13, 14, 15, 16
work page 2025
-
[61]
Video-3d llm: Learning position-aware video representation for 3d scene understanding
Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8995–9006, 2025. 2, 3
work page 2025
-
[62]
Scene parsing through ade20k dataset
Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641,
-
[63]
Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields
Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Ze- hao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024. 4
work page 2024
-
[64]
Feature4x: Bridging any monocular video to 4d agentic ai with versatile gaussian feature fields
Shijie Zhou, Hui Ren, Yijia Weng, Shuwang Zhang, Zhen Wang, Dejia Xu, Zhiwen Fan, Suya You, Zhangyang Wang, Leonidas Guibas, et al. Feature4x: Bridging any monocular video to 4d agentic ai with versatile gaussian feature fields. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 14179–14190, 2025. 4
work page 2025
-
[65]
Vlm4d: To- wards spatiotemporal awareness in vision language models
Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Xin Eric Wang, and Achuta Kadambi. Vlm4d: To- wards spatiotemporal awareness in vision language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 8600–8612, 2025. 2, 3
work page 2025
-
[66]
Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities
Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4295–4305, 2025. 2, 3
work page 2025
-
[67]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 3 11 SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning Supplementary Material In this supplementary material, we provide comprehen- si...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.