Introduces MultiDepth-3k benchmark revealing diverse layer preferences across depth models on ambiguous scenes, with Laplacian Visual Prompting altering outputs for some frozen models and best pair reaching 75.5% ML-SRA.
hub Canonical reference
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
Canonical reference. 70% of citing Pith papers cite this work as background.
abstract
This paper tackles the problem of depth estimation from a single image. Existing work either focuses on generalization performance disregarding metric scale, i.e. relative depth estimation, or state-of-the-art results on specific datasets, i.e. metric depth estimation. We propose the first approach that combines both worlds, leading to a model with excellent generalization performance while maintaining metric scale. Our flagship model, ZoeD-M12-NK, is pre-trained on 12 datasets using relative depth and fine-tuned on two datasets using metric depth. We use a lightweight head with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier. Our framework admits multiple configurations depending on the datasets used for relative depth pre-training and metric fine-tuning. Without pre-training, we can already significantly improve the state of the art (SOTA) on the NYU Depth v2 indoor dataset. Pre-training on twelve datasets and fine-tuning on the NYU Depth v2 indoor dataset, we can further improve SOTA for a total of 21% in terms of relative absolute error (REL). Finally, ZoeD-M12-NK is the first model that can jointly train on multiple datasets (NYU Depth v2 and KITTI) without a significant drop in performance and achieve unprecedented zero-shot generalization performance to eight unseen datasets from both indoor and outdoor domains. The code and pre-trained models are publicly available at https://github.com/isl-org/ZoeDepth .
hub tools
citation-role summary
citation-polarity summary
representative citing papers
MetricScenes dataset from web photos and stereo imagery, plus a two-stage Poisson depth completion method, allows fine-tuning MoGe-2 to mitigate scale-collapse in metric monocular geometry while preserving benchmark performance.
TROPHIES introduces a unified framework for human-scene-camera reconstruction from multi-view videos, achieving globally aligned and physically plausible 4D outputs on EgoHuman and EgoExo4D.
SeeGroup formulates per-pixel multi-layer depth as a point process with permutation-invariant likelihood to support arbitrary groupings, raising quadruplet relative depth accuracy from 61.34% to 70.09% on the LayeredDepth benchmark.
GAMSI is a dual-pathway Geometry-Aware MLLM using Metric-Structure Decoupled Queries and Expert-Guided Visual Grounding on RGB inputs alone, trained on a new 152k-sample MTS dataset to reach SOTA on seven spatial benchmarks.
WideDepth supplies the first millimeter-accurate indoor fisheye depth benchmark together with a stereo generation pipeline and model adaptation technique.
H-Flow learns dense human scene flow from monocular video via joint pose and depth prediction in a multi-head transformer, using physics-inspired geometric and biomechanical priors for self-supervision, and introduces the DynAct4D synthetic benchmark.
Depth2Pose is a new evaluation framework for monocular depth estimators that uses relative camera pose accuracy as a task-driven proxy and introduces the D2P dataset of challenging out-of-distribution scenes.
LAMP tracks 3D human motion from moving multi-camera headsets by converting 2D detections to a unified metric 3D world frame via device localization and fitting with an end-to-end spatio-temporal transformer.
Dual-pixel defocus blur enables absolute scale estimation in SfM without reference objects or calibration.
A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
LiftFormer transforms monocular depth prediction into depth-oriented geometric and edge-aware subspace representations via lifting and frame theory, achieving state-of-the-art results on standard datasets.
EndoVGGT uses a dynamic DeGAT graph attention module to improve depth estimation and non-rigid 3D reconstruction in surgery, reporting 24.6% PSNR and 9.1% SSIM gains on SCARED with zero-shot generalization to new domains.
VGGT-360 delivers geometry-consistent zero-shot panoramic depth by converting panoramas into multi-view 3D reconstructions via VGGT models and three plug-and-play correction modules, then reprojecting the result.
RAD retrieves semantically similar RGB-D context samples for low-confidence regions and fuses them via matched cross-attention to cut relative absolute depth error by 29.2% on NYU Depth v2 underrepresented classes while staying competitive on standard benchmarks.
URF-GS creates a single radiation field from visual and wireless observations via 3D Gaussian splatting to predict radio signals at any location and configuration with higher accuracy and fewer samples than prior NeRF approaches.
Proposes the first light field-LiDAR semantic segmentation dataset and the Mlpfseg network, which improves mIoU by 1.71 over image-only and 2.38 over point-cloud-only baselines via feature completion and depth perception modules.
Materialist performs single-image inverse rendering via neural-initialized progressive differentiable rendering to enable physically consistent material editing, object insertion, relighting, and transparency edits without full scene geometry.
3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
AerialMetric is a new benchmark dataset and evaluation suite for adapting monocular metric depth estimation models to real-world UAV aerial views.
Leading VLMs show high cross-view consistency paired with low metric accuracy on distance queries, indicating evidence-insensitive reasoning rather than geometric grounding.
Introduces Embodied Tool Protocol and tool externalization to improve embodied AI performance on perception and cognition tasks, with measured gains but limits on execution capabilities.
PaGeR is a framework that lifts perspective 3D foundation models to omnidirectional images through mixed training, enabling unified prediction of scale-invariant depth, metric depth, surface normals, and sky masks from single panoramas.
DyFN is a lightweight recurrent module that dynamically normalizes latent feature statistics to remove scale-shift drift and achieve state-of-the-art temporal consistency in streaming monocular geometry estimation while updating only 2% of parameters.
citing papers explorer
-
Geometry-Aware Cross Modal Alignment for Light Field-LiDAR Semantic Segmentation
Proposes the first light field-LiDAR semantic segmentation dataset and the Mlpfseg network, which improves mIoU by 1.71 over image-only and 2.38 over point-cloud-only baselines via feature completion and depth perception modules.
-
Materialist: Physically Based Editing Using Single-Image Inverse Rendering
Materialist performs single-image inverse rendering via neural-initialized progressive differentiable rendering to enable physically consistent material editing, object insertion, relighting, and transparency edits without full scene geometry.
-
GeCo: Evaluating Geometric Consistency for Video Generation via Motion and Structure
GeCo is a new geometry-based metric that produces dense maps of motion and structure inconsistencies in video generation by fusing residual motion and depth priors.
-
Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles
ThinkDeeper introduces a world-model-based reasoning step that predicts future spatial states to improve multimodal visual grounding for autonomous vehicles, achieving top results on Talk2Car and other benchmarks.
-
Geometry-Aware Scene Configurations for Novel View Synthesis
Geometry-guided adaptive placement of bases and virtual viewpoints improves rendering quality and memory use over uniform arrangements in scalable NeRF for large indoor scenes.
-
ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving
ROVR is a new diverse depth dataset for autonomous driving with 200K frames, released pipelines, and ablations showing sparse ground truth supports model training.
-
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.
-
UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler
UniDepthV2 predicts metric 3D points directly from single images using a self-promptable camera module, pseudo-spherical representation, and new losses for improved cross-domain generalization.
-
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 million real-world episodes.
-
DepthMaster: Taming Diffusion Models for Monocular Depth Estimation
DepthMaster proposes a single-step diffusion model with Feature Alignment and Fourier Enhancement modules in a two-stage training process to improve generalization and detail preservation in monocular depth estimation over prior diffusion methods.
-
Step1X-Edit: A Practical Framework for General Image Editing
Step1X-Edit integrates a multimodal LLM with a diffusion decoder, trained on a custom high-quality dataset, to deliver image editing performance that surpasses open-source baselines and approaches proprietary models on the new GEdit-Bench.
- PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation