Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
Pith reviewed 2026-06-29 07:45 UTC · model grok-4.3
The pith
VLMs acquire reliable 3D spatial reasoning by training on video point correspondences and depth consistency instead of 3D VQA data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Training a correspondence head with contrastive loss on ground-truth point matches and depth-consistency supervision across all LLM layers produces internal representations whose point-matching behavior improves from under 5 percent to over 70 percent accuracy; these representations then support substantially stronger performance on downstream 3D spatial benchmarks without any exposure to 3D VQA data during training.
What carries the argument
GASP's correspondence head, a lightweight module applied as deep supervision to every transformer layer and trained with a dual objective of contrastive point-correspondence loss plus depth consistency loss.
If this is right
- Internal point correspondence accuracy rises from below 5 percent to above 70 percent peak layer-wise and remains above 85 percent under temporal shifts.
- Performance on All-Angles Bench increases by 18.2 percentage points and on VSI-Bench by 29.0 percentage points.
- These gains occur without any 3D VQA training data, reducing the risk of overfitting to dataset-specific question biases.
- The same internal improvements appear across multiple layers rather than only at the final layer.
Where Pith is reading between the lines
- The method could be applied to video-only models to improve temporal 3D consistency without task-specific labels.
- If the geometric priors transfer across different video domains, the approach might lower the annotation cost for building spatial reasoning systems.
- Similar deep-supervision heads could be tested on other low-level geometric signals such as surface normals or optical flow.
Load-bearing premise
Ground-truth point correspondences and depth values extracted from large-scale video scenes are accurate enough and free of systematic bias to produce internal representations that generalize to new spatial tasks.
What would settle it
An ablation that keeps the same architecture and training budget but removes the geometric supervision losses, then measures whether the reported gains on All-Angles Bench and VSI-Bench disappear while internal correspondence accuracy stays low.
read the original abstract
Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks including +18.2% on All-Angles Bench and +29.0% on VSI-Bench, all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GASP, a framework to inject 3D spatial priors into VLMs via a correspondence head with deep supervision across transformer layers. It trains on a dual objective (contrastive loss on ground-truth point correspondences from video scenes for view-invariance plus depth consistency) without any 3D VQA data, reports lifting internal correspondence accuracy from <5% to >70% with >85% temporal robustness, and claims resulting gains of +18.2% on All-Angles Bench and +29.0% on VSI-Bench.
Significance. If the central results hold after addressing validation gaps, the work would be significant for demonstrating that geometric priors extracted from video can improve VLM spatial reasoning in a manner that avoids overfitting to 3D VQA datasets and does not require specialized 3D encoders. The diagnostic analysis of baseline VLM correspondence failures provides a useful internal metric, and the separation between training scenes and evaluation benchmarks supports the generalizability argument.
major comments (2)
- [Abstract] Abstract and training objective description: The central claim that improvements arise from fundamental geometric priors (rather than dataset-specific fitting) depends on the accuracy and lack of bias in the ground-truth point correspondences and depths extracted from large-scale video scenes. No independent validation, error metrics, or comparison against synthetic ground truth is provided for the extraction pipeline itself; if extraction errors correlate with scene statistics shared with the evaluation benchmarks, the reported lifts could reflect data artifacts rather than emergent understanding.
- [Results] Results and analysis sections: The reported internal accuracy improvements (baseline <5% to >70% peak layer-wise) and downstream gains (+18.2%, +29.0%) are presented without error bars, ablation studies isolating the contrastive versus depth-consistency terms, dataset sizes for the video training scenes, or statistical tests. These omissions make it impossible to assess whether the dual objective is load-bearing or if the gains are robust.
minor comments (1)
- [Abstract] The abstract references All-Angles Bench and VSI-Bench without defining the tasks or citing their sources; adding these details would improve clarity for readers unfamiliar with the benchmarks.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. The feedback identifies key areas where additional validation and experimental details would strengthen the manuscript. We address each major comment below and will incorporate the suggested improvements in the revised version.
read point-by-point responses
-
Referee: [Abstract] Abstract and training objective description: The central claim that improvements arise from fundamental geometric priors (rather than dataset-specific fitting) depends on the accuracy and lack of bias in the ground-truth point correspondences and depths extracted from large-scale video scenes. No independent validation, error metrics, or comparison against synthetic ground truth is provided for the extraction pipeline itself; if extraction errors correlate with scene statistics shared with the evaluation benchmarks, the reported lifts could reflect data artifacts rather than emergent understanding.
Authors: We agree that validating the ground-truth extraction pipeline is important for supporting the central claim. The correspondences and depths are obtained via standard video-based methods, but we acknowledge the absence of explicit error analysis in the current manuscript. In the revision, we will add a dedicated validation subsection that reports error metrics (e.g., correspondence precision and depth RMSE) by comparing the extracted geometry against synthetic ground-truth scenes generated with a graphics engine. We will also check for correlation between extraction errors and statistics of the evaluation benchmarks to rule out data artifacts. revision: yes
-
Referee: [Results] Results and analysis sections: The reported internal accuracy improvements (baseline <5% to >70% peak layer-wise) and downstream gains (+18.2%, +29.0%) are presented without error bars, ablation studies isolating the contrastive versus depth-consistency terms, dataset sizes for the video training scenes, or statistical tests. These omissions make it impossible to assess whether the dual objective is load-bearing or if the gains are robust.
Authors: We concur that these details are necessary to evaluate robustness. The revised manuscript will include: (1) error bars computed across multiple random seeds for all reported metrics; (2) ablation experiments that isolate the contrastive loss from the depth-consistency term; (3) the exact number of video scenes and frames used for training; and (4) statistical significance tests (paired t-tests with p-values) comparing GASP against baselines. These additions will clarify the contribution of each component of the dual objective. revision: yes
Circularity Check
No significant circularity; derivation relies on external GT geometry and held-out benchmarks.
full rationale
The paper's core derivation trains a correspondence head and depth consistency objective on ground-truth point matches and depths extracted from large-scale video scenes, then measures downstream gains on separate spatial benchmarks (All-Angles Bench, VSI-Bench) that receive no 3D VQA supervision. The reported internal correspondence lift (5% → 70%) is a direct consequence of the contrastive training signal but is presented only as a diagnostic; the load-bearing claim is generalization to unseen spatial tasks. No equation, self-citation chain, or fitted parameter is shown to redefine the benchmark improvements as quantities already present in the training inputs. The extraction pipeline for GT is treated as an external data source rather than a self-referential construct.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large-scale video scenes supply accurate ground-truth point correspondences and depth values suitable for supervision.
Reference graph
Works this paper leans on
-
[1]
Claude, 2024
Anthropic. Claude, 2024. 1
2024
-
[2]
Vivit: A video vision transformer
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021. 17
2021
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 1, 2, 6, 7, 14
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Is space-time attention all you need for video understanding? InICML, page 4, 2021
Gedas Bertasius, Heng Wang, and Lorenzo Torre- sani. Is space-time attention all you need for video understanding? InICML, page 4, 2021. 17
2021
-
[5]
Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fer- gus, and Saining Xie. Benchmark designers should" train on the test set" to expose exploitable non-visual shortcuts.arXiv preprint arXiv:2511.04655, 2025. 15
-
[6]
Spa- tialvlm: Endowing vision-language models with spa- tial reasoning capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spa- tialvlm: Endowing vision-language models with spa- tial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024. 3, 7
2024
-
[7]
Ll3da: Visual interactive instruc- tion tuning for omni-3d understanding reasoning and planning
Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruc- tion tuning for omni-3d understanding reasoning and planning. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 26428–26438, 2024. 1, 3
2024
-
[8]
Internvl: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 1, 7
2024
-
[9]
Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xu- fang Luo, Mingze Sun, Zihao Pan, Yan Feng, Peng Pei, Xunliang Cai, and Ruqi Huang. Think with 3d: Geometric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632,
-
[10]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021. 17
2021
-
[11]
Palm-e: An embodied multimodal language model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wen- long Huang, et al. Palm-e: An embodied multimodal language model. InInternational Conference on Ma- chine Learning (ICML), 2023. 1
2023
-
[12]
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision- language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Video- mme: The first-ever comprehensive evaluation bench- mark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video- mme: The first-ever comprehensive evaluation bench- mark of multi-modal llms in video analysis. InPro- ceedings of the Computer Vision and Pattern Recog- nition Conference, pages 24108–24118, 2025. 7
2025
-
[14]
Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wen- han Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024. 3
-
[15]
Blink: Multimodal large language models can see but not perceive
XingyuFu, YushiHu, BangzhengLi, YuFeng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei- Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. In European Conference on Computer Vision, pages 148–166. Springer, 2024. 2, 7
2024
-
[16]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023
Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023. 3
2023
-
[18]
Chat- scene: Bridging 3d scene and large language models with object identifiers.Advances in Neural Informa- tion Processing Systems, 37:113991–114017, 2024
Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, et al. Chat- scene: Bridging 3d scene and large language models with object identifiers.Advances in Neural Informa- tion Processing Systems, 37:113991–114017, 2024
2024
-
[19]
An Embodied Generalist Agent in 3D World
Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Language models as zero-shot plan- ners: Extracting actionable knowledge for embod- ied agents
Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot plan- ners: Extracting actionable knowledge for embod- ied agents. InInternational conference on machine learning, pages 9118–9147. PMLR, 2022. 1 10
2022
-
[21]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Rad- ford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 1, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Track4gen: Teach- ing video diffusion models to track points improves video generation
Hyeonho Jeong, Chun-Hao P Huang, Jong Chul Ye, Niloy J Mitra, and Duygu Ceylan. Track4gen: Teach- ing video diffusion models to track points improves video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7276–7287, 2025. 3
2025
-
[23]
Sceneverse: Scaling 3d vision-language learn- ing for grounded scene understanding
Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, and Siyuan Huang. Sceneverse: Scaling 3d vision-language learn- ing for grounded scene understanding. InEuro- pean Conference on Computer Vision, pages 289–310. Springer, 2024. 3
2024
-
[24]
Ego- humans: An ego-centric 3d multi-human benchmark
Rawal Khirodkar, Aayush Bansal, Lingni Ma, Richard Newcombe, Minh Vo, and Kris Kitani. Ego- humans: An ego-centric 3d multi-human benchmark. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 19807–19819,
-
[25]
Supervised contrastive learning.Advances in neural information processing systems, 33:18661–18673, 2020
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning.Advances in neural information processing systems, 33:18661–18673, 2020. 4
2020
-
[26]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Grounding image matching in 3d with mast3r
Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. In European Conference on Computer Vision, pages 71–91. Springer, 2024. 2
2024
-
[28]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. Llava-onevision: Easy vi- sual task transfer.arXiv preprint arXiv:2408.03326,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world understanding?arXiv preprint arXiv:2503.23765,
-
[30]
Dl3dv-10k: A large- scale scene dataset for deep learning-based 3d vi- sion
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large- scale scene dataset for deep learning-based 3d vi- sion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024. 2, 4, 6, 14
2024
-
[31]
Moka: Open-vocabulary robotic manipula- tion through mark-based visual prompting
Fangchen Liu, Kuan Fang, Pieter Abbeel, and Sergey Levine. Moka: Open-vocabulary robotic manipula- tion through mark-based visual prompting. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024. 1
2024
-
[32]
Improved baselines with visual instruction tun- ing
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tun- ing. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 26296– 26306, 2024. 7
2024
-
[33]
TempCompass: Do Video LLMs Really Understand Videos?
Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Viewpoint rosetta stone: Unlocking un- paired ego-exo videos for view-invariant represen- tation learning
Mi Luo, Zihui Xue, Alex Dimakis, and Kristen Grauman. Viewpoint rosetta stone: Unlocking un- paired ego-exo videos for view-invariant represen- tation learning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15802–15812, 2025. 2
2025
-
[35]
Jisu Nam, Soowon Son, Dahyun Chung, Jiyoung Kim, Siyoon Jin, Junhwa Hur, and Seungryong Kim. Emergent temporal correspondences from video diffu- sion transformers.arXiv preprint arXiv:2506.17220,
-
[36]
Dantong Niu, Yuvan Sharma, Giscard Biamby, Jerome Quenum, Yutong Bai, Baifeng Shi, Trevor Darrell, and Roei Herzig. Llarva: Vision-action in- struction tuning enhances robot learning.arXiv preprint arXiv:2406.11815, 2024. 1
-
[37]
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning. arXiv preprint arXiv:2504.01805, 2025. 1, 3, 15
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Bootstrap your own views: Masked ego-exo modeling for fine-grained view-invariant video representations
Jungin Park, Jiyoung Lee, and Kwanghoon Sohn. Bootstrap your own views: Masked ego-exo modeling for fine-grained view-invariant video representations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13661–13670, 2025. 2
2025
-
[39]
Gpt4scene: Understand 3d scenes from videos with vision-language models
ZhangyangQi, ZhixiongZhang, YeFang, JiaqiWang, and Hengshuang Zhao. Gpt4scene: Understand 3d scenes from videos with vision-language models. arXiv preprint arXiv:2501.01428, 2025. 1, 3
-
[40]
Enhancing video-llm reasoning via agent-of-thoughts distillation
Yudi Shi, Shangzhe Di, Qirui Chen, and Weidi Xie. Enhancing video-llm reasoning via agent-of-thoughts distillation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8523– 8533, 2025. 7
2025
-
[41]
One step at a time: Long-horizon vision-and-language navigation with milestones
Chan Hee Song, Jihyung Kil, Tai-Yu Pan, Brian M Sadler, Wei-Lun Chao, and Yu Su. One step at a time: Long-horizon vision-and-language navigation with milestones. InProceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, pages 15482–15491, 2022. 1 11
2022
-
[42]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568: 127063, 2024
Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568: 127063, 2024. 17
2024
-
[43]
Alessandro Suglia, Qiaozi Gao, Jesse Thomason, Govind Thattai, and Gaurav Sukhatme. Embodied bert: A transformer model for embodied, language- guided visual task completion.arXiv preprint arXiv:2108.04927, 2021. 1
-
[44]
SpaceVista: All-Scale Visual Spatial Reasoning from mm to km
Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, et al. Spacevista: All-scale visual spatial reasoning from mm to km.arXiv preprint arXiv:2510.09606, 2025. 1, 3, 16
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 1, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491,
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
RAFT: Recurrent all- pairs field transforms for optical flow
Zachary Teed and Jia Deng. RAFT: Recurrent all- pairs field transforms for optical flow. InEur. Conf. Comput. Vis., 2020. 5
2020
-
[48]
Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Ad- vances in Neural Information Processing Systems, 37:87310–87356, 2024
Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Midde- pogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Ad- vances in Neural Information Processing Systems, 37:87310–87356, 2024. 7
2024
-
[49]
Vggt: Visual geometry grounded trans- former
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded trans- former. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306,
-
[50]
Dust3r: Geo- metric 3d vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geo- metric 3d vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024. 4
2024
-
[51]
Wentao Wang, Heqing Zou, Tianze Luo, Rui Huang, Yutian Zhao, Zhuochen Wang, Hansheng Zhang, Chengwei Qin, Yan Wang, Lin Zhao, et al. Video- str: Reinforcing mllms in video spatio-temporal reasoning with relation graph.arXiv preprint arXiv:2510.10976, 2025. 1, 3
-
[52]
Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, and Zhou Zhao. Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes.arXiv preprint arXiv:2308.08769, 2023. 1, 3
-
[53]
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Rein- forcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965, 2025. 1, 3, 15
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Next-qa: Next phase of question-answering to explaining temporal actions
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021. 7
2021
-
[56]
Vlm-grounder: A vlm agent for zero-shot 3d visual grounding.arXiv preprint arXiv:2410.13860, 2024
Runsen Xu, Zhiwei Huang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Vlm-grounder: A vlm agent for zero-shot 3d visual grounding.arXiv preprint arXiv:2410.13860, 2024. 3
-
[57]
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feis- zli, and Kevin J Liang. Multi-spatialmllm: Multi- frame spatial understanding with multi-modal large language models.arXiv preprint arXiv:2505.17015,
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent
Jianing Yang, Xuweiyi Chen, Shengyi Qian, Nikhil Madaan, Madhavan Iyengar, David F Fouhey, and Joyce Chai. Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 7694–7701. IEEE, 2024. 3
2024
-
[59]
3d-grand: A million-scale dataset for 3d-llms with better grounding and less hallucination
Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Mad- havan Iyengar, Shengyi Qian, David F Fouhey, and Joyce Chai. 3d-grand: A million-scale dataset for 3d-llms with better grounding and less hallucination. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29501–29512, 2025. 3
2025
-
[60]
Thinking in space: How multimodal large language models see, remember, and recall spaces
Jihan Yang, Shusheng Yang, Anjali W Gupta, Ri- lyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, pages 10632–10643, 2025. 1, 2, 3, 7, 15
2025
-
[61]
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. Mmsi-bench: A benchmark for multi-image spatial intelligence. arXiv preprint arXiv:2505.23764, 2025. 1, 16
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Ruoyu Wang, Tianzhe Chu, Yuex- iang Zhai, Yubei Chen, Shenghua Gao, and Yi 12 Ma. Seeing from another perspective: Evaluating multi-view understanding in mllms.arXiv preprint arXiv:2504.15280, 2025. 2, 7
-
[63]
Mmmu: A mas- sive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A mas- sive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024. 1
2024
-
[64]
arXiv preprint arXiv:2503.22976 (2025) 5, 6, 22, 24
Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976,
-
[65]
Llava-next: A strong zero-shot video understanding model, 2024
Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model, 2024. 2, 6, 7, 14
2024
-
[66]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Ze- jun Ma, Ziwei Liu, and Chunyuan Li. Video in- struction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025. 1, 3, 7
-
[68]
Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness
ChenmingZhu, TaiWang, WenweiZhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness. arXiv preprint arXiv:2409.18125, 2024. 3
-
[69]
Struct2d: A perception-guided framework for spatial reasoning in mllms
Fangrui Zhu, Hanhui Wang, Yiming Xie, Jing Gu, Tianye Ding, Jianwei Yang, and Huaizu Jiang. Struct2d: A perception-guided framework for spatial reasoning in mllms. InThe Thirty-ninth Annual Con- ference on Neural Information Processing Systems,
-
[70]
hack" the benchmark by learning superficial dataset-specific biases (e.g.,
1, 3 13 Appendix Overview In this supplementary material, we provide details on our geometric training data collection in Section A. Next, we provide full implementation details, includ- ing the correspondence head architecture (Hc) and all training hyperparameters, in Section B. Follow- ing this, we detail the evaluation protocol used to measure correspo...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.