hub

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

· 2024 · cs.CV · arXiv 2410.02073

21 Pith papers cite this work. Polarity classification is still indexing.

21 Pith papers citing it

open full Pith review browse 21 citing papers arXiv PDF

abstract

We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image. Extensive experiments analyze specific design choices and demonstrate that Depth Pro outperforms prior work along multiple dimensions. We release code and weights at https://github.com/apple/ml-depth-pro

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

AmbiSuR adds intrinsic photometric disambiguation and a self-indication module to Gaussian Splatting to resolve ambiguities and improve surface reconstruction accuracy.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

cs.RO · 2026-05-07 · unverdicted · novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

Monocular Depth Estimation via Neural Network with Learnable Algebraic Group and Ring Structures

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

LAGRNet embeds learnable algebraic group, ring, and sheaf structures into a neural network to improve accuracy and generalization in monocular depth estimation.

LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.

Globally Optimal Pose from Orthographic Silhouettes

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

A search-based algorithm achieves globally optimal pose estimation from silhouettes alone by querying precomputed area response surfaces and auxiliary ellipse aspect ratios for any shape.

3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

3D-Fixer performs in-place 3D asset completion from single-view partial point clouds via coarse-to-fine generation with ORFA conditioning, plus a new ARSG-110K dataset, to achieve higher geometric accuracy than MIDI and Gen3DSR while keeping diffusion efficiency.

Training a Student Expert via Semi-Supervised Foundation Model Distillation

cs.CV · 2026-04-04 · conditional · novelty 7.0

A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.

HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

HairOrbit leverages video generation priors and a neural orientation extractor to achieve state-of-the-art strand-level 3D hair reconstruction from single-view portraits in visible and invisible regions.

Multi-Modal Guided Multi-Source Domain Adaptation for Object Detection

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

MS-DePro achieves state-of-the-art performance on multi-source domain adaptation benchmarks for object detection by using depth-guided region proposals and multi-modal alignment of learnable text embeddings.

A Data Efficiency Study of Synthetic Fog for Object Detection Using the Clear2Fog Pipeline

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

Clear2Fog generates realistic synthetic fog from clear scenes, enabling mixed-density training that outperforms full fixed-density data and improves real-world performance by 1.67 mAP after learning-rate adjustment.

GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

GeoQuery replaces corrupted rendering features with geometry-aligned proxy queries and restricts cross-view attention to local windows, enabling robust diffusion-based refinement under extreme view sparsity.

Target-depth sensing with metasurface-encoder integrated optoelectronic neural network

physics.optics · 2026-04-28 · unverdicted · novelty 6.0

A metasurface optical encoder compresses depth into 2D images for a shadow ResNet to achieve high accuracy in both target classification and depth estimation on MNIST and vehicle datasets.

MLG-Stereo: ViT Based Stereo Matching with Multi-Stage Local-Global Enhancement

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

MLG-Stereo adds multi-granularity feature extraction, local-global cost volumes, and guided recurrent refinement to ViT stereo matching, yielding competitive results on Middlebury, KITTI-2015, and strong results on KITTI-2012.

In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting

cs.CV · 2026-04-07 · unverdicted · novelty 6.0

A selective regularization framework lets scale-ambiguous monocular depth priors improve Gaussian Splatting geometry and rendering by isolating and supervising only ill-posed regions.

Depth Anything 3: Recovering the Visual Space from Any Views

cs.CV · 2025-11-13 · unverdicted · novelty 6.0

DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.

The Midas Touch for Metric Depth

cs.CV · 2026-05-12 · unverdicted · novelty 5.0

MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.

Sapiens2

cs.CV · 2026-04-23 · unverdicted · novelty 5.0

Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and albedo estimation.

Qwen-Image Technical Report

cs.CV · 2025-08-04 · unverdicted · novelty 5.0

Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive encoding for editing consistency.

Genie Sim PanoRecon: Fast Immersive Scene Generation from Single-View Panorama

cs.RO · 2026-04-08 · unverdicted · novelty 4.0

A feed-forward Gaussian-splatting system reconstructs photo-realistic 3D scenes from single-view panoramas in seconds via cube-map decomposition and depth-aware fusion for robotic simulation use.

Depth-Aware Rover: A Study of Edge AI and Monocular Vision for Real-World Implementation

cs.CV · 2026-04-24 · unverdicted · novelty 3.0

Monocular depth estimation with UniDepthV2 on Raspberry Pi enables cost-effective rover navigation, proving more robust than stereo vision in real-world tests at 0.1 FPS depth and 10 FPS detection.

Image Generators are Generalist Vision Learners

cs.CV · 2026-04-22

citing papers explorer

Showing 21 of 21 citing papers.

Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction cs.CV · 2026-05-12 · unverdicted · none · ref 93 · internal anchor
AmbiSuR adds intrinsic photometric disambiguation and a self-indication module to Gaussian Splatting to resolve ambiguities and improve surface reconstruction accuracy.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation cs.RO · 2026-05-07 · unverdicted · none · ref 4 · internal anchor
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Monocular Depth Estimation via Neural Network with Learnable Algebraic Group and Ring Structures cs.CV · 2026-04-27 · unverdicted · none · ref 20 · internal anchor
LAGRNet embeds learnable algebraic group, ring, and sheaf structures into a neural network to improve accuracy and generalization in monocular depth estimation.
LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation cs.CV · 2026-04-10 · unverdicted · none · ref 4 · internal anchor
A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.
Globally Optimal Pose from Orthographic Silhouettes cs.CV · 2026-04-10 · unverdicted · none · ref 4 · internal anchor
A search-based algorithm achieves globally optimal pose estimation from silhouettes alone by querying precomputed area response surfaces and auxiliary ellipse aspect ratios for any shape.
3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image cs.CV · 2026-04-06 · unverdicted · none · ref 3 · internal anchor
3D-Fixer performs in-place 3D asset completion from single-view partial point clouds via coarse-to-fine generation with ORFA conditioning, plus a new ARSG-110K dataset, to achieve higher geometric accuracy than MIDI and Gen3DSR while keeping diffusion efficiency.
Training a Student Expert via Semi-Supervised Foundation Model Distillation cs.CV · 2026-04-04 · conditional · none · ref 4 · internal anchor
A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.
HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits cs.CV · 2026-04-03 · unverdicted · none · ref 1 · internal anchor
HairOrbit leverages video generation priors and a neural orientation extractor to achieve state-of-the-art strand-level 3D hair reconstruction from single-view portraits in visible and invisible regions.
Multi-Modal Guided Multi-Source Domain Adaptation for Object Detection cs.CV · 2026-05-13 · unverdicted · none · ref 13 · internal anchor
MS-DePro achieves state-of-the-art performance on multi-source domain adaptation benchmarks for object detection by using depth-guided region proposals and multi-modal alignment of learnable text embeddings.
A Data Efficiency Study of Synthetic Fog for Object Detection Using the Clear2Fog Pipeline cs.CV · 2026-05-12 · unverdicted · none · ref 49 · internal anchor
Clear2Fog generates realistic synthetic fog from clear scenes, enabling mixed-density training that outperforms full fixed-density data and improves real-world performance by 1.67 mAP after learning-rate adjustment.
GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction cs.CV · 2026-05-12 · unverdicted · none · ref 37 · internal anchor
GeoQuery replaces corrupted rendering features with geometry-aligned proxy queries and restricts cross-view attention to local windows, enabling robust diffusion-based refinement under extreme view sparsity.
Target-depth sensing with metasurface-encoder integrated optoelectronic neural network physics.optics · 2026-04-28 · unverdicted · none · ref 42 · internal anchor
A metasurface optical encoder compresses depth into 2D images for a shadow ResNet to achieve high accuracy in both target classification and depth estimation on MNIST and vehicle datasets.
MLG-Stereo: ViT Based Stereo Matching with Multi-Stage Local-Global Enhancement cs.CV · 2026-04-22 · unverdicted · none · ref 42 · internal anchor
MLG-Stereo adds multi-granularity feature extraction, local-global cost volumes, and guided recurrent refinement to ViT stereo matching, yielding competitive results on Middlebury, KITTI-2015, and strong results on KITTI-2012.
In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting cs.CV · 2026-04-07 · unverdicted · none · ref 4 · internal anchor
A selective regularization framework lets scale-ambiguous monocular depth priors improve Gaussian Splatting geometry and rendering by isolating and supervising only ill-posed regions.
Depth Anything 3: Recovering the Visual Space from Any Views cs.CV · 2025-11-13 · unverdicted · none · ref 6 · internal anchor
DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
The Midas Touch for Metric Depth cs.CV · 2026-05-12 · unverdicted · none · ref 3 · internal anchor
MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.
Sapiens2 cs.CV · 2026-04-23 · unverdicted · none · ref 7 · internal anchor
Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and albedo estimation.
Qwen-Image Technical Report cs.CV · 2025-08-04 · unverdicted · none · ref 3 · internal anchor
Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive encoding for editing consistency.
Genie Sim PanoRecon: Fast Immersive Scene Generation from Single-View Panorama cs.RO · 2026-04-08 · unverdicted · none · ref 25 · internal anchor
A feed-forward Gaussian-splatting system reconstructs photo-realistic 3D scenes from single-view panoramas in seconds via cube-map decomposition and depth-aware fusion for robotic simulation use.
Depth-Aware Rover: A Study of Edge AI and Monocular Vision for Real-World Implementation cs.CV · 2026-04-24 · unverdicted · none · ref 6 · internal anchor
Monocular depth estimation with UniDepthV2 on Raspberry Pi enables cost-effective rover navigation, proving more robust than stereo vision in real-world tests at 0.1 FPS depth and 10 FPS detection.
Image Generators are Generalist Vision Learners cs.CV · 2026-04-22 · unreviewed · ref 3 · internal anchor

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer