Recognition: 2 theorem links
· Lean TheoremDepth Pro: Sharp Monocular Metric Depth in Less Than a Second
Pith reviewed 2026-05-14 20:40 UTC · model grok-4.3
The pith
Depth Pro produces sharp, metric-scale depth maps from single images in 0.3 seconds without any camera metadata.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Depth Pro synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. These characteristics are enabled by an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image.
What carries the argument
An efficient multi-scale vision transformer for dense prediction paired with a mixed real-synthetic training protocol that jointly optimizes metric scale and boundary fidelity.
If this is right
- The model generates 2.25-megapixel depth maps in 0.3 seconds on a standard GPU.
- Depth estimates remain metric and absolute without camera intrinsics or other metadata.
- Boundary accuracy improves measurably through the dedicated evaluation metrics introduced.
- Single-image focal length estimation reaches state-of-the-art levels as a byproduct.
- Overall performance exceeds prior monocular depth methods across multiple accuracy dimensions.
Where Pith is reading between the lines
- Single-image depth systems could now be deployed in settings where camera calibration data is unavailable or unreliable.
- The emphasis on boundary sharpness suggests the outputs may integrate more cleanly into downstream 3D reconstruction pipelines.
- Hybrid real-synthetic training may generalize to other dense prediction tasks that require both metric consistency and fine detail preservation.
Load-bearing premise
The training protocol that mixes real and synthetic datasets together with the multi-scale vision transformer succeeds at delivering both accurate absolute scale and sharp boundaries in zero-shot settings without camera intrinsics.
What would settle it
A collection of real-world images with independently measured ground-truth metric depths and focal lengths where Depth Pro produces large scale errors or visibly blurred object boundaries when run without any camera metadata.
read the original abstract
We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image. Extensive experiments analyze specific design choices and demonstrate that Depth Pro outperforms prior work along multiple dimensions. We release code and weights at https://github.com/apple/ml-depth-pro
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Depth Pro, a foundation model for zero-shot metric monocular depth estimation. It claims to synthesize high-resolution depth maps with high sharpness and fine details, providing absolute metric scale from a single image without camera intrinsics or metadata. The model runs in 0.3 seconds for 2.25-megapixel outputs on standard GPUs. Key contributions include an efficient multi-scale vision transformer for dense prediction, a training protocol mixing real and synthetic data for metric accuracy and boundary precision, new dedicated metrics for boundary accuracy, and state-of-the-art single-image focal length estimation. Extensive experiments are said to demonstrate outperformance over prior work along multiple dimensions, with code and weights released.
Significance. If the central claims hold, the work would be significant for computer vision applications requiring fast, high-quality metric depth from monocular images without calibration data, such as robotics, AR, and 3D reconstruction. The speed, resolution, and zero-shot metric capability without intrinsics represent a practical advance over prior methods. The open release of code and weights is a clear strength for reproducibility and further research.
major comments (2)
- [Experiments (focal length and metric depth evaluation)] The metric-scale claim without camera intrinsics rests on the single-image focal length estimator (described as SOTA in the abstract). No section isolates focal-length prediction error on diverse real-world test sets or quantifies its propagation into absolute depth metrics; if MAE exceeds ~10-15% on out-of-distribution scenes, the reported metric gains would be undermined.
- [Method and Experiments sections] The training protocol combining real and synthetic datasets is asserted to achieve both high metric accuracy and fine boundary tracing in zero-shot settings, but the manuscript provides no ablation that separates the contribution of the multi-scale vision transformer from the data mixture or the new boundary metrics.
minor comments (1)
- [Abstract] The abstract states outperformance 'along multiple dimensions' but does not preview any quantitative numbers, error bars, or dataset names; this should be added for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the focal length evaluation and ablation studies. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experiments (focal length and metric depth evaluation)] The metric-scale claim without camera intrinsics rests on the single-image focal length estimator (described as SOTA in the abstract). No section isolates focal-length prediction error on diverse real-world test sets or quantifies its propagation into absolute depth metrics; if MAE exceeds ~10-15% on out-of-distribution scenes, the reported metric gains would be undermined.
Authors: We agree that isolating the focal length estimator's accuracy and its effect on metric depth is important for validating the zero-shot claims. The current manuscript reports overall depth metrics and states SOTA focal length performance, but does not include a dedicated breakdown. In the revision, we will add a new subsection in the Experiments section reporting focal length MAE and relative error on multiple real-world datasets (NYU, KITTI, ETH3D, and others) and will quantify propagation by recomputing depth metrics using predicted versus ground-truth focal lengths where available. This will directly address potential error accumulation in out-of-distribution scenes. revision: yes
-
Referee: [Method and Experiments sections] The training protocol combining real and synthetic datasets is asserted to achieve both high metric accuracy and fine boundary tracing in zero-shot settings, but the manuscript provides no ablation that separates the contribution of the multi-scale vision transformer from the data mixture or the new boundary metrics.
Authors: The manuscript includes targeted experiments on design choices and overall performance, but we acknowledge that it lacks fully disentangled ablations separating the multi-scale ViT architecture, the real+synthetic data mixture, and the boundary-specific losses/metrics. In the revised version, we will expand the Experiments section with additional ablation tables that train and evaluate controlled variants (e.g., single-scale vs. multi-scale, real-only vs. mixed data, with vs. without boundary terms) to clearly attribute the gains in metric accuracy and boundary precision to each component. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper trains a multi-scale vision transformer on external real and synthetic datasets to produce metric depth maps, with focal length estimation presented as an auxiliary SOTA component rather than a self-referential fit. No equation or claim reduces a prediction to its own inputs by construction, nor does any load-bearing step rely on a self-citation chain that itself lacks independent verification. Evaluation uses newly proposed boundary metrics on held-out benchmarks, keeping the central zero-shot metric claim externally falsifiable and independent of the model's fitted parameters.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 22 Pith papers
-
Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction
AmbiSuR adds intrinsic photometric disambiguation and a self-indication module to Gaussian Splatting to resolve ambiguities and improve surface reconstruction accuracy.
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
Monocular Depth Estimation via Neural Network with Learnable Algebraic Group and Ring Structures
LAGRNet embeds learnable algebraic group, ring, and sheaf structures into a neural network to improve accuracy and generalization in monocular depth estimation.
-
LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation
A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.
-
Globally Optimal Pose from Orthographic Silhouettes
A search-based algorithm achieves globally optimal pose estimation from silhouettes alone by querying precomputed area response surfaces and auxiliary ellipse aspect ratios for any shape.
-
3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image
3D-Fixer performs in-place 3D asset completion from single-view partial point clouds via coarse-to-fine generation with ORFA conditioning, plus a new ARSG-110K dataset, to achieve higher geometric accuracy than MIDI a...
-
Training a Student Expert via Semi-Supervised Foundation Model Distillation
A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.
-
HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits
HairOrbit leverages video generation priors and a neural orientation extractor to achieve state-of-the-art strand-level 3D hair reconstruction from single-view portraits in visible and invisible regions.
-
Multi-Modal Guided Multi-Source Domain Adaptation for Object Detection
MS-DePro achieves state-of-the-art performance on multi-source domain adaptation benchmarks for object detection by using depth-guided region proposals and multi-modal alignment of learnable text embeddings.
-
A Data Efficiency Study of Synthetic Fog for Object Detection Using the Clear2Fog Pipeline
Clear2Fog generates realistic synthetic fog from clear scenes, enabling mixed-density training that outperforms full fixed-density data and improves real-world performance by 1.67 mAP after learning-rate adjustment.
-
GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction
GeoQuery replaces corrupted rendering features with geometry-aligned proxy queries and restricts cross-view attention to local windows, enabling robust diffusion-based refinement under extreme view sparsity.
-
Target-depth sensing with metasurface-encoder integrated optoelectronic neural network
A metasurface optical encoder compresses depth into 2D images for a shadow ResNet to achieve high accuracy in both target classification and depth estimation on MNIST and vehicle datasets.
-
MLG-Stereo: ViT Based Stereo Matching with Multi-Stage Local-Global Enhancement
MLG-Stereo adds multi-granularity feature extraction, local-global cost volumes, and guided recurrent refinement to ViT stereo matching, yielding competitive results on Middlebury, KITTI-2015, and strong results on KI...
-
Image Generators are Generalist Vision Learners
Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.
-
Image Generators are Generalist Vision Learners
Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.
-
In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting
A selective regularization framework lets scale-ambiguous monocular depth priors improve Gaussian Splatting geometry and rendering by isolating and supervising only ill-posed regions.
-
Depth Anything 3: Recovering the Visual Space from Any Views
DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
-
The Midas Touch for Metric Depth
MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.
-
Sapiens2
Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and...
-
Qwen-Image Technical Report
Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive en...
-
Genie Sim PanoRecon: Fast Immersive Scene Generation from Single-View Panorama
A feed-forward Gaussian-splatting system reconstructs photo-realistic 3D scenes from single-view panoramas in seconds via cube-map decomposition and depth-aware fusion for robotic simulation use.
-
Depth-Aware Rover: A Study of Edge AI and Monocular Vision for Real-World Implementation
Monocular depth estimation with UniDepthV2 on Raspberry Pi enables cost-effective rover navigation, proving more robust than stereo vision in real-world tests at 0.1 FPS depth and 10 FPS detection.
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
Attention Attention Everywhere: Monocular Depth Prediction with Skip Attention , author=. 2022 , journal=
work page 2022
-
[4]
Unilmv2: Pseudo-masked language models for unified language model pre-training , author=. ICML , year=
-
[5]
Hangbo Bao and Li Dong and Songhao Piao and Furu Wei , booktitle=
- [6]
-
[7]
Bauer, Zuria and Gomez-Donoso, Francisco and Cruz, Edmanuel and Orts-Escolano, Sergio and Cazorla, Miguel , journal=
-
[8]
Shariq Farooq Bhat and Reiner Birkl and Diana Wofk and Peter Wonka and Matthias M. arXiv , year =
-
[9]
Shariq Farooq Bhat and Ibraheem Alhashim and Peter Wonka , title =. ECCV , year =
-
[10]
Shariq Farooq Bhat and Ibraheem Alhashim and Peter Wonka , title =. CVPR , year =
- [11]
-
[12]
Black and Priyanka Patel and Joachim Tesch and Jinlong Yang , title =
Michael J. Black and Priyanka Patel and Joachim Tesch and Jinlong Yang , title =. CVPR , year =
-
[13]
A naturalistic open source movie for optical flow evaluation , author =. ECCV , year =
-
[14]
Vladimir Bychkovsky and Sylvain Paris and Eric Chan and Fr. Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs , booktitle =
-
[15]
Holger Caesar and Varun Bankiti and Alex H. Lang and Sourabh Vora and Venice Erin Liong and Qiang Xu and Anush Krishnan and Yu Pan and Giancarlo Baldan and Oscar Beijbom , title =. CVPR , year =
-
[16]
Han Cai and Junyan Li and Muyan Hu and Chuang Gan and Song Han , title =. ICCV , year =
-
[17]
Pix2Video: Video Editing using Image Diffusion , booktitle =
Duygu Ceylan and Chun. Pix2Video: Video Editing using Image Diffusion , booktitle =
-
[18]
Weifeng Chen and Zhao Fu and Dawei Yang and Jia Deng , title =. NIPS , year =
- [19]
-
[20]
Cheng, Ho Kei and Chung, Jihoon and Tai, Yu-Wing and Tang, Chi-Keung , booktitle=
-
[21]
Xiangxiang Chu and Zhi Tian and Yuqing Wang and Bo Zhang and Haibing Ren and Xiaolin Wei and Huaxia Xia and Chunhua Shen , title =. NeurIPS , year =
-
[22]
Chang and Manolis Savva and Maciej Halber and Thomas A
Angela Dai and Angel X. Chang and Manolis Savva and Maciej Halber and Thomas A. Funkhouser and Matthias Nie. CVPR , year =
-
[23]
Dang-Nguyen, Duc-Tien and Pasquini, Cecilia and Conotter, Valentina and Boato, Giulia , booktitle=
-
[24]
NeurIPS Datasets & Benchmarks , year =
Afshin Dehghan and Gilad Baruch and Zhuoyuan Chen and Yuri Feigin and Peter Fu and Thomas Gebauer and Daniel Kurz and Tal Dimry and Brandon Joffe and Arik Schwartz and Elad Shulman , title =. NeurIPS Datasets & Benchmarks , year =
-
[25]
Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle=
-
[26]
Neil A. Dodgson , title =. Stereoscopic Displays and Virtual Reality Systems XI , year =
-
[27]
Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby , title =. ICLR , year =
-
[28]
Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in
Ding, Xiaohan and Zhang, Xiangyu and Han, Jungong and Ding, Guiguang , booktitle=. Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in
-
[29]
Ainaz Eftekhar and Alexander Sax and Jitendra Malik and Amir Zamir , title =. ICCV , year =
- [30]
- [31]
-
[32]
Fan, Haoqi and Xiong, Bo and Mangalam, Karttikeya and Li, Yanghao and Yan, Zhicheng and Malik, Jitendra and Feichtenhofer, Christoph , title =. ICCV , year =
-
[33]
Fang, Yuming and Zhu, Hanwei and Zeng, Yan and Ma, Kede and Wang, Zhou , title=. CVPR , year=
-
[34]
Huan Fu and Mingming Gong and Chaohui Wang and Kayhan Batmanghelich and Dacheng Tao , title =. CVPR , year =
- [35]
-
[36]
Adrien Gaidon and Qiao Wang and Yohann Cabon and Eleonora Vig , title =. CVPR , year =
-
[37]
Andreas Geiger and Philip Lenz and Christoph Stiller and Raquel Urtasun , title =. IJRR , volume =
- [38]
-
[39]
Gordon and Trevor Darrell and Michael Harville and John Woodfill , title =
Gaile G. Gordon and Trevor Darrell and Michael Harville and John Woodfill , title =. CVPR , year =
-
[40]
Ming Gui and Johannes S. Fischer and Ulrich Prestel and Pingchuan Ma and Dmytro Kotovenko and Olga Grebenkova and Stefan Andreas Baumann and Vincent Tao Hu and Bj. AAAI , year =
-
[41]
Vitor Guizilini and Igor Vasiljevic and Dian Chen and Rares Ambrus and Adrien Gaidon , title =. ICCV , year =
-
[42]
Vitor Guizilini and Rares Ambrus and Sudeep Pillai and Allan Raventos and Adrien Gaidon , title =. CVPR , year =
-
[43]
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , title =. CVPR , year =
-
[44]
He, Kaiming and Chen, Xinlei and Xie, Saining and Li, Yanghao and Doll\'ar, Piotr and Girshick, Ross , title =. CVPR , year =
-
[45]
Peter Hedman and Suhib Alsisan and Richard Szeliski and Johannes Kopf , title =
- [46]
-
[47]
Hu, Mu and Yin, Wei and Zhang, Chi and Cai, Zhipeng and Long, Xiaoxiao and Chen, Hao and Wang, Kaixuan and Yu, Gang and Shen, Chunhua and Shen, Shaojie , title=. TPAMI , volume =
- [48]
-
[49]
Xinyu Huang and Peng Wang and Xinjing Cheng and Dingfu Zhou and Qichuan Geng and Ruigang Yang , title =. TPAMI , volume =
- [50]
-
[51]
Freeman and David Salesin and Brian Curless and Ce Liu , title =
Varun Jampani and Huiwen Chang and Kyle Sargent and Abhishek Kar and Richard Tucker and Michael Krainin and Dominik Kaeser and William T. Freeman and David Salesin and Brian Curless and Ce Liu , title =. ICCV , year =
-
[52]
Yuanfeng Ji and Zhe Chen and Enze Xie and Lanqing Hong and Xihui Liu and Zhaoqiang Liu and Tong Lu and Zhenguo Li and Ping Luo , title =. ICCV , year =
-
[53]
Liang, Jie and Zeng, Hui and Cui, Miaomiao and Xie, Xuansong and Zhang, Lei , title =. CVPR , year =
-
[54]
Oguzhan Fatih Kar and Teresa Yeo and Andrei Atanov and Amir Zamir , title =. CVPR , year =
-
[55]
Nikita Karaev and Ignacio Rocco and Benjamin Graham and Natalia Neverova and Andrea Vedaldi and Christian Rupprecht , title =. CVPR , year =
-
[56]
Bingxin Ke and Anton Obukhov and Shengyu Huang and Nando Metzger and Rodrigo Caye Daudt and Konrad Schindler , title =. CVPR , year =
-
[57]
MeSa: Masked, Geometric, and Supervised Pre-training for Monocular Depth Estimation , author=. 2023 , journal=
work page 2023
- [58]
-
[59]
Youngjung Kim and Bumsub Ham and Changjae Oh and Kwanghoon Sohn , title =. TIP , volume =
-
[60]
and Lo, Wan-Yen and Dollar, Piotr and Girshick, Ross , title =
Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C. and Lo, Wan-Yen and Dollar, Piotr and Girshick, Ross , title =. ICCV , year =
-
[61]
and Tesch, Joachim and M\"uller, Lea and Hilliges, Otmar and Black, Michael J
Kocabas, Muhammed and Huang, Chun-Hao P. and Tesch, Joachim and M\"uller, Lea and Hilliges, Otmar and Black, Michael J. , booktitle =
-
[62]
Tobias Koch and Lukas Liebel and Friedrich Fraundorfer and Marco K. Evaluation of. ECCV Workshops , year =
-
[63]
Text-image Alignment for Diffusion-based Perception , author=. 2023 , journal=
work page 2023
-
[64]
Anastasiia Kornilova and Marsel Faizullin and Konstantin Pakulev and Andrey Sadkov and Denis Kukushkin and Azat Akhmetyanov and Timur Akhtyamov and Hekmat Taherinejad and Gonzalo Ferrer , title =. CVPR , year =
- [65]
-
[66]
Mykola Lavreniuk and Shariq Farooq Bhat and Matthias Müller and Peter Wonka , title =. arXiv , year =
- [67]
-
[68]
From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation , author=. 2021 , journal=
work page 2021
-
[69]
Youngwan Lee and Jonghee Kim and Jeffrey Willette and Sung Ju Hwang , title =. CVPR , year =
- [70]
- [71]
- [72]
-
[73]
Zhengqi Li and Simon Niklaus and Noah Snavely and Oliver Wang , title =. CVPR , year =
-
[74]
Exploring plain vision transformer backbones for object detection , author=. ECCV , year=
-
[75]
Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph , title =. CVPR , year =
-
[76]
Bridging composite and real: Towards end-to-end deep image matting , author=. IJCV , volume=
-
[77]
Zhenyu Li and Xuyang Wang and Xianming Liu and Junjun Jiang , title =. TIP , volume =
- [78]
-
[79]
DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation , author=. MIR , year=
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.