hub Contested

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

· 2024 · cs.CV · arXiv 2410.02073

Contested. 1 Pith paper cite this work to dispute or refute its claims.

57 Pith papers citing it

Contested 1 dispute or refute

open full Pith review browse 57 citing papers arXiv PDF

abstract

We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image. Extensive experiments analyze specific design choices and demonstrate that Depth Pro outperforms prior work along multiple dimensions. We release code and weights at https://github.com/apple/ml-depth-pro

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 baseline 2 method 2

citation-polarity summary

background 3 use method 2 baseline 1 contest 1 unclear 1

representative citing papers

QWERTY: Training-Free Motion Control via Query-Warped Video Diffusion Transformers

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

QWERTY enables training-free motion control in pretrained image-to-video DiTs by warping the frame-invariant semantic subspace of queries in 3D full attention and using the predicted noise as self-guidance for latent optimization.

DepthMaster: Unified Monocular Depth Estimation for Perspective and Panoramic Images

cs.CV · 2026-06-10 · unverdicted · novelty 7.0

DepthMaster unifies metric monocular depth estimation for perspective and panoramic images by patching panoramas into perspective views, adding a consistency loss and virtual cameras, and training mostly on perspective data to reach SOTA zero-shot results on 13 datasets.

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Authors create ReasonMatch-Bench and DCRL training to boost MLLM performance on wide-baseline matching, reporting gains over baselines while preserving general capabilities.

Honey, I Shrunk the Arc de Triomphe!

cs.CV · 2026-06-01 · unverdicted · novelty 7.0 · 2 refs

MetricScenes dataset from web photos and stereo imagery, plus a two-stage Poisson depth completion method, allows fine-tuning MoGe-2 to mitigate scale-collapse in metric monocular geometry while preserving benchmark performance.

WideDepth: Millimeter-Accurate Benchmark for Fisheye Depth Estimation

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

WideDepth supplies the first millimeter-accurate indoor fisheye depth benchmark together with a stereo generation pipeline and model adaptation technique.

H-Flow: Self-supervised Human Scene Flow via Physics-inspired Joint Multi-modal Learning

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

H-Flow learns dense human scene flow from monocular video via joint pose and depth prediction in a multi-head transformer, using physics-inspired geometric and biomechanical priors for self-supervision, and introduces the DynAct4D synthetic benchmark.

Depth2Pose: A Pose-Based Benchmark for Monocular Depth Estimation without Ground-Truth Depth

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Depth2Pose is a new evaluation framework for monocular depth estimators that uses relative camera pose accuracy as a task-driven proxy and introduces the D2P dataset of challenging out-of-distribution scenes.

Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

AmbiSuR adds intrinsic photometric disambiguation and a self-indication module to Gaussian Splatting to resolve ambiguities and improve surface reconstruction accuracy.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

cs.RO · 2026-05-07 · unverdicted · novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

Monocular Depth Estimation via Neural Network with Learnable Algebraic Group and Ring Structures

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

LAGRNet embeds learnable algebraic group, ring, and sheaf structures into a neural network to improve accuracy and generalization in monocular depth estimation.

LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.

Globally Optimal Pose from Orthographic Silhouettes

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

A search-based algorithm achieves globally optimal pose estimation from silhouettes alone by querying precomputed area response surfaces and auxiliary ellipse aspect ratios for any shape.

3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

3D-Fixer performs in-place 3D asset completion from single-view partial point clouds via coarse-to-fine generation with ORFA conditioning, plus a new ARSG-110K dataset, to achieve higher geometric accuracy than MIDI and Gen3DSR while keeping diffusion efficiency.

Training a Student Expert via Semi-Supervised Foundation Model Distillation

cs.CV · 2026-04-04 · conditional · novelty 7.0

A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.

HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

HairOrbit leverages video generation priors and a neural orientation extractor to achieve state-of-the-art strand-level 3D hair reconstruction from single-view portraits in visible and invisible regions.

Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation

cs.CV · 2026-03-02 · unverdicted · novelty 7.0

Low-rank decoder adaptation enables efficient test-time optimization for zero-shot depth completion by updating only the subspace containing depth-relevant information.

SurgiSR4K: A High-Resolution Endoscopic Video Dataset for Robotic-Assisted Minimally Invasive Procedures

eess.IV · 2025-06-30 · unverdicted · novelty 7.0

Introduces the first publicly accessible native 4K resolution endoscopic video dataset for robotic-assisted minimally invasive procedures.

GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance

cs.CV · 2025-03-17 · unverdicted · novelty 7.0

GuideDog supplies 22K egocentric image-description pairs from 46 countries and an 818-sample QA benchmark showing that current multimodal models still struggle with depth perception and BLV-specific guidance rules.

WaterGen: Decoupling Scene and Medium in Underwater Image Generation

cs.CV · 2026-06-30 · unverdicted · novelty 6.0

WaterGen decouples scene generation from medium degradation in a two-stage latent diffusion process to produce controllable realistic underwater images that improve downstream restoration and segmentation.

ShotcreteDepth: A Bi-modal Dataset for Robust Robotic Depth Perception in Shotcrete Construction Environments

cs.RO · 2026-06-22 · unverdicted · novelty 6.0

Presents ShotcreteDepth, a dataset of 11,252 synchronized stereo RGB and LiDAR samples from construction sites with 220 annotations for depth tasks in harsh conditions.

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

cs.CV · 2026-06-15 · unverdicted · novelty 6.0

Qwen-RobotWorld is a language-conditioned video world model using Double-Stream MMDiT, an 8.6M-frame embodied corpus, and progressive curriculum training that ranks first on EWMBench and DreamGen Bench.

Modality Forcing for Scalable Spatial Generation

cs.CV · 2026-06-11 · unverdicted · novelty 6.0

Modality Forcing lets a single DiT produce image and depth outputs in any order after training on sparse real-world depth, with larger image-pretrained models yielding better depth accuracy and a 57% AbsRel reduction versus prior joint generative baselines.

A 3D Isovist World Model -- Revealing a City's Unseen Geometry and Its Emergent Cross-City Signature

cs.RO · 2026-06-02 · unverdicted · novelty 6.0

A city-blind 3D isovist prediction model trained on Manhattan and Paris yields city identity linearly decodable from temporal latents above single-frame baselines.

VLM3: Vision Language Models Are Native 3D Learners

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

Standard VLMs achieve expert-level 3D performance on depth estimation, pose estimation, and object understanding via three simple techniques without architecture changes or regression losses.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Target-depth sensing with metasurface-encoder integrated optoelectronic neural network physics.optics · 2026-04-28 · unverdicted · none · ref 42 · internal anchor
A metasurface optical encoder compresses depth into 2D images for a shadow ResNet to achieve high accuracy in both target classification and depth estimation on MNIST and vehicle datasets.

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer