Unlocking Dense Metric Depth Estimation in VLMs

· 2026 · cs.CV · arXiv 2605.15876

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified multimodal foundation model. The project page is available at https://depthvlm.github.io/

representative citing papers

Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors

cs.CV · 2026-06-05 · unverdicted · novelty 6.0

Stream3D-VLM adds autoregressive streaming control, VSFI geometry integration, GAVC compression, and a 1M-pair benchmark to enable real-time 3D VLM performance that beats prior models on 29 online and offline tasks.

citing papers explorer

Showing 1 of 1 citing paper.

Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors cs.CV · 2026-06-05 · unverdicted · none · ref 53 · internal anchor
Stream3D-VLM adds autoregressive streaming control, VSFI geometry integration, GAVC compression, and a 1M-pair benchmark to enable real-time 3D VLM performance that beats prior models on 29 online and offline tasks.

Unlocking Dense Metric Depth Estimation in VLMs

fields

years

verdicts

representative citing papers

citing papers explorer