VLM3: Vision Language Models Are Native 3D Learners
Pith reviewed 2026-06-29 07:43 UTC · model grok-4.3
The pith
Vision language models learn 3D tasks using only focal length unification, text-based pixel references, and data scaling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VLMs are native 3D learners where focal length unification, text-based pixel reference, and data mixture and scaling suffice for effective 3D learning, rendering model architecture changes, larger models, heavy data augmentations, and complex losses unnecessary.
What carries the argument
The three enabling factors of focal length unification, text-based pixel reference, and data mixture and scaling that allow standard VLMs to master 3D tasks through text-based training.
If this is right
- Depth estimation accuracy improves substantially from 0.84 to 0.9 on standard benchmarks.
- Pixel correspondence, camera pose estimation, and object-level 3D understanding reach accuracy levels comparable to expert vision models.
- Standard VLM architectures and text-based training suffice without task-specific modifications.
- VLM3 provides a scalable method for diverse 3D tasks using the simplest design.
Where Pith is reading between the lines
- This approach may allow 3D capabilities to emerge in general multimodal models without dedicated 3D training pipelines.
- It could simplify integration of 3D understanding into applications like autonomous navigation by reusing existing VLM infrastructure.
- Testing these factors on even larger VLMs or different data distributions might reveal further performance gains.
Load-bearing premise
The observed performance gains on 3D tasks result solely from focal length unification, text-based pixel reference, and data mixture and scaling, isolated from other training variables.
What would settle it
A controlled experiment applying focal length unification, text-based pixel reference, and data mixture and scaling to a VLM that shows no improvement in depth estimation or other 3D metrics beyond the baseline.
read the original abstract
Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic understanding. However, 3D understanding still largely relies on expert vision models with complex task-specific designs. The key argument this work wants to make is that VLMs are native 3D learners. Our in-depth large scale study shows that 1) focal length unification, 2) text-based pixel reference and 3) data mixture and scaling, are all you need for effective 3D learning. Model architecture changes, large models, heavy data augmentations, and complex losses including the regression formulation, many of which form the foundation of expert vision models, are actually not necessary conditions. As a result, we propose VLM3, a scalable method with the simplest design that enables standard VLMs to master diverse 3D tasks. VLM3 not only advances the VLM depth estimation accuracy by a large margin (0.84 -> 0.9), but also enables diverse 3D tasks such as pixel correspondence, camera pose estimation and object-level 3D understanding, matching expert vision model accuracy while maintaining standard architectures and text-based training. We believe VLM3 opens up a new paradigm for simple and scalable 3D learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that Vision Language Models are native 3D learners. Its central argument, based on an in-depth large-scale study, is that focal length unification, text-based pixel reference, and data mixture/scaling are jointly sufficient for effective 3D learning in standard VLMs. It proposes the simple VLM3 method, which reportedly advances depth estimation accuracy from 0.84 to 0.9 and matches expert vision model performance on pixel correspondence, camera pose estimation, and object-level 3D tasks, while showing that architecture changes, larger models, heavy augmentations, and complex losses (including regression) are unnecessary.
Significance. If the empirical isolation of the three factors holds under controlled conditions, the result would be significant: it would indicate that standard VLMs can handle diverse 3D tasks via minimal, text-based adaptations and data strategies, challenging the necessity of specialized 3D architectures and potentially enabling a simpler, more scalable paradigm.
major comments (1)
- [Abstract] Abstract: The claim that focal length unification, text-based pixel reference, and data mixture/scaling are all that is needed (and that architecture changes, model scale, augmentations, and complex losses are not necessary) rests on an asserted 'in-depth large scale study,' yet the manuscript supplies no methodology details, dataset descriptions, ablation tables, or controls that hold model size, training recipes, and other variables fixed across conditions. This is load-bearing for the central claim, as performance gains (e.g., the cited depth improvement) cannot be attributed to the three factors without such isolation.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and constructive feedback. We address the major comment below and agree that greater clarity on the experimental design is warranted to support the central claims.
read point-by-point responses
-
Referee: The claim that focal length unification, text-based pixel reference, and data mixture/scaling are all that is needed (and that architecture changes, model scale, augmentations, and complex losses are not necessary) rests on an asserted 'in-depth large scale study,' yet the manuscript supplies no methodology details, dataset descriptions, ablation tables, or controls that hold model size, training recipes, and other variables fixed across conditions. This is load-bearing for the central claim, as performance gains (e.g., the cited depth improvement) cannot be attributed to the three factors without such isolation.
Authors: We agree that the abstract is high-level and that explicit isolation of the three factors requires clear documentation of controls. The full manuscript contains an experimental section with dataset descriptions, training details, and ablation studies; however, these may not sufficiently highlight fixed variables (model size, recipes) or directly attribute gains to focal length unification, text-based referencing, and data scaling. We will revise by expanding the methods/experiments section with a dedicated controlled-study subsection, additional ablation tables that explicitly hold other factors fixed, and clearer result attribution. This addresses the load-bearing concern without altering the core findings. revision: yes
Circularity Check
No derivation chain; empirical study presents no self-referential reductions
full rationale
The paper advances an empirical claim based on a large-scale study that focal length unification, text-based pixel reference, and data mixture/scaling suffice for 3D learning in VLMs, rendering architecture changes, model scale, augmentations, and regression losses unnecessary. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central argument is framed as experimental outcomes rather than a mathematical chain that reduces to its inputs by construction, so no circularity of the enumerated kinds is present.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025a. Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zha...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
Aleksei Bochkovskii, AmaÃG, l Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models.arXiv preprint arXiv:2406.13642,
-
[4]
Depthlm: Metric depth from vision language models.arXiv preprint arXiv:2509.25413,
Zhipeng Cai, Ching-Feng Yeh, Hu Xu, Zhuang Liu, Gregory Meyer, Xinjie Lei, Changsheng Zhao, Shang-Wen Li, Vikas Chandra, and Yangyang Shi. Depthlm: Metric depth from vision language models.arXiv preprint arXiv:2509.25413,
-
[5]
Matterport3D: Learning from RGB-D Data in Indoor Environments
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
SpaceLLaVA github contributors. Spacellava. 2024.https://huggingface.co/remyxai/SpaceLLaV A. Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062,
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction. arXiv preprint arXiv:2509.13414,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Depth Anything 3: Recovering the Visual Space from Any Views
Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Visual-rft: Visual reinforcement fine-tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2034–2044,
2034
-
[12]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler
Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler.arXiv preprint arXiv:2502.20110,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Gim: Learning generalizable image matcher from internet videos.arXiv preprint arXiv:2402.11095,
12 Xuelun Shen, Zhipeng Cai, Wei Yin, Matthias Müller, Zijun Li, Kaixuan Wang, Xiaozhi Chen, and Cheng Wang. Gim: Learning generalizable image matcher from internet videos.arXiv preprint arXiv:2402.11095,
-
[16]
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546, 2025b. Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProcee...
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting
Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feiszli, and Kevin J Liang. Multi-spatialmllm: Multi-frame spatial understanding with multi-modal large language models. arXiv preprint arXiv:2505.17015,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
On the generalization capacities of mllms for spatial intelligence.arXiv preprint arXiv:2603.06704,
Gongjie Zhang, Wenhao Li, Quanhao Qian, Jiuniu Wang, Deli Zhao, Shijian Lu, and Ran Xu. On the generalization capacities of mllms for spatial intelligence.arXiv preprint arXiv:2603.06704,
-
[20]
Ufm: A simple path towards unified dense correspondence with flow.arXiv preprint arXiv:2506.09278,
13 Yuchen Zhang, Nikhil Keetha, Chenwei Lyu, Bhuvan Jhamb, Yutian Chen, Yuheng Qiu, Jay Karhade, Shreyas Jha, Yaoyu Hu, Deva Ramanan, et al. Ufm: A simple path towards unified dense correspondence with flow.arXiv preprint arXiv:2506.09278,
-
[21]
Stereo Magnification: Learning View Synthesis using Multiplane Images
Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
14 Appendix A Further Implementation Details Table 4 Hyper-parameters. TaskDepth Estimation Object-level 3D Pixel correspondence Camera pose estimation Learning rate5.5e-5 3.5e-4 2e-5 5e-5 Batch size1344 640 2816 448 Number of samples32M (10 pixels each) 1M 80M (10 pixels each) 10M Table 5 Training data statistics. Depth Estimation DatasetsNumber of image...
2023
-
[23]
450K dynamicreplica (Karaev et al., 2023)1M sail vos3d (Hu et al.,
2023
-
[24]
350K ScanNet++ (Yeshw anth et al., 2023)1M MPSD (Antequera et al.,
2023
-
[25]
13K RealEstate-10K (Zhou et al., 2018)880K DL3dv-10k (Ling et al.,
2018
-
[26]
190K Aria Synthetic Environment (A vetisyan et al., 2024)2M GTA-SFM (W ang and Shen,
2024
-
[27]
850K UnrealStereo4K (Tosi et al., 2021)270K MVS Synth (Huang et al.,
2021
-
[28]
Similar to previous works (Cai et al., 2025; Lin et al., 2025), we hold out 30 scenes from ScanNet++ to ensure the evaluation data come from unseen scenes
to randomly sample image pairs with 15 > 25%covisibility. Similar to previous works (Cai et al., 2025; Lin et al., 2025), we hold out 30 scenes from ScanNet++ to ensure the evaluation data come from unseen scenes. 16
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.