pith. sign in

hub Canonical reference

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Canonical reference. 70% of citing Pith papers cite this work as background.

56 Pith papers citing it
Background 70% of classified citations
abstract

This paper tackles the problem of depth estimation from a single image. Existing work either focuses on generalization performance disregarding metric scale, i.e. relative depth estimation, or state-of-the-art results on specific datasets, i.e. metric depth estimation. We propose the first approach that combines both worlds, leading to a model with excellent generalization performance while maintaining metric scale. Our flagship model, ZoeD-M12-NK, is pre-trained on 12 datasets using relative depth and fine-tuned on two datasets using metric depth. We use a lightweight head with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier. Our framework admits multiple configurations depending on the datasets used for relative depth pre-training and metric fine-tuning. Without pre-training, we can already significantly improve the state of the art (SOTA) on the NYU Depth v2 indoor dataset. Pre-training on twelve datasets and fine-tuning on the NYU Depth v2 indoor dataset, we can further improve SOTA for a total of 21% in terms of relative absolute error (REL). Finally, ZoeD-M12-NK is the first model that can jointly train on multiple datasets (NYU Depth v2 and KITTI) without a significant drop in performance and achieve unprecedented zero-shot generalization performance to eight unseen datasets from both indoor and outdoor domains. The code and pre-trained models are publicly available at https://github.com/isl-org/ZoeDepth .

hub tools

citation-role summary

background 7 method 2 baseline 1

citation-polarity summary

clear filters

representative citing papers

Honey, I Shrunk the Arc de Triomphe!

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

MetricScenes dataset from web photos and stereo imagery, plus a two-stage Poisson depth completion method, allows fine-tuning MoGe-2 to mitigate scale-collapse in metric monocular geometry while preserving benchmark performance.

Dual-Pathway Geometry-Aware MLLM for Spatial Intelligence

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

GAMSI is a dual-pathway Geometry-Aware MLLM using Metric-Structure Decoupled Queries and Expert-Guided Visual Grounding on RGB inputs alone, trained on a new 152k-sample MTS dataset to reach SOTA on seven spatial benchmarks.

VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

cs.CV · 2026-03-19 · unverdicted · novelty 7.0

VGGT-360 delivers geometry-consistent zero-shot panoramic depth by converting panoramas into multi-view 3D reconstructions via VGGT models and three plug-and-play correction modules, then reprojecting the result.

Materialist: Physically Based Editing Using Single-Image Inverse Rendering

cs.CV · 2025-01-07 · unverdicted · novelty 7.0

Materialist performs single-image inverse rendering via neural-initialized progressive differentiable rendering to enable physically consistent material editing, object insertion, relighting, and transparency edits without full scene geometry.

3D-VLA: A 3D Vision-Language-Action Generative World Model

cs.CV · 2024-03-14 · unverdicted · novelty 7.0

3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.

Enabling Extensible Embodied Capabilities with Tools

cs.RO · 2026-05-26 · unverdicted · novelty 6.0

Introduces Embodied Tool Protocol and tool externalization to improve embodied AI performance on perception and cognition tasks, with measured gains but limits on execution capabilities.

Unified Panoramic Geometry Estimation via Multi-View Foundation Models

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

PaGeR is a framework that lifts perspective 3D foundation models to omnidirectional images through mixed training, enabling unified prediction of scale-invariant depth, metric depth, surface normals, and sky masks from single panoramas.

Stabilizing Streaming Video Geometry via Dynamic Feature Normalization

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

DyFN is a lightweight recurrent module that dynamically normalizes latent feature statistics to remove scale-shift drift and achieve state-of-the-art temporal consistency in streaming monocular geometry estimation while updating only 2% of parameters.

Unlocking Dense Metric Depth Estimation in VLMs

cs.CV · 2026-05-15 · unverdicted · novelty 6.0 · 2 refs

DepthVLM converts a standard VLM into a dense metric depth predictor by attaching a lightweight head and training under unified vision-text supervision, outperforming prior VLMs and some pure vision models on a new indoor-outdoor benchmark.

Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentation, surface normal estimation, and semantic segmentation.

citing papers explorer

Showing 1 of 1 citing paper after filters.

  • Depth Anything V2 cs.CV · 2024-06-13 · unverdicted · none · ref 6 · internal anchor

    Depth Anything V2 delivers finer, more robust monocular depth predictions by replacing real labeled images with synthetic data, scaling the teacher model, and using large-scale pseudo-labeled real images for student training.