hub Canonical reference

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

· 2023 · cs.CV · arXiv 2302.12288

Canonical reference. 70% of citing Pith papers cite this work as background.

54 Pith papers citing it

Background 70% of classified citations

open full Pith review browse 54 citing papers arXiv PDF

abstract

This paper tackles the problem of depth estimation from a single image. Existing work either focuses on generalization performance disregarding metric scale, i.e. relative depth estimation, or state-of-the-art results on specific datasets, i.e. metric depth estimation. We propose the first approach that combines both worlds, leading to a model with excellent generalization performance while maintaining metric scale. Our flagship model, ZoeD-M12-NK, is pre-trained on 12 datasets using relative depth and fine-tuned on two datasets using metric depth. We use a lightweight head with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier. Our framework admits multiple configurations depending on the datasets used for relative depth pre-training and metric fine-tuning. Without pre-training, we can already significantly improve the state of the art (SOTA) on the NYU Depth v2 indoor dataset. Pre-training on twelve datasets and fine-tuning on the NYU Depth v2 indoor dataset, we can further improve SOTA for a total of 21% in terms of relative absolute error (REL). Finally, ZoeD-M12-NK is the first model that can jointly train on multiple datasets (NYU Depth v2 and KITTI) without a significant drop in performance and achieve unprecedented zero-shot generalization performance to eight unseen datasets from both indoor and outdoor domains. The code and pre-trained models are publicly available at https://github.com/isl-org/ZoeDepth .

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 method 2 baseline 1

citation-polarity summary

background 7 use method 2 baseline 1

representative citing papers

Honey, I Shrunk the Arc de Triomphe!

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

MetricScenes dataset from web photos and stereo imagery, plus a two-stage Poisson depth completion method, allows fine-tuning MoGe-2 to mitigate scale-collapse in metric monocular geometry while preserving benchmark performance.

SeeGroup: Multi-Layer Depth Estimation of Transparent Surfaces via Self-Determined Grouping

cs.CV · 2026-05-27 · unverdicted · novelty 7.0

SeeGroup formulates per-pixel multi-layer depth as a point process with permutation-invariant likelihood to support arbitrary groupings, raising quadruplet relative depth accuracy from 61.34% to 70.09% on the LayeredDepth benchmark.

H-Flow: Self-supervised Human Scene Flow via Physics-inspired Joint Multi-modal Learning

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

H-Flow learns dense human scene flow from monocular video via joint pose and depth prediction in a multi-head transformer, using physics-inspired geometric and biomechanical priors for self-supervision, and introduces the DynAct4D synthetic benchmark.

Depth2Pose: A Pose-Based Benchmark for Monocular Depth Estimation without Ground-Truth Depth

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Depth2Pose is a new evaluation framework for monocular depth estimators that uses relative camera pose accuracy as a task-driven proxy and introduces the D2P dataset of challenging out-of-distribution scenes.

LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

LAMP tracks 3D human motion from moving multi-camera headsets by converting 2D detections to a unified metric 3D world frame via device localization and fitting with an end-to-end spatio-temporal transformer.

DP-SfM: Dual-Pixel Structure-from-Motion without Scale Ambiguity

cs.CV · 2026-05-03 · unverdicted · novelty 7.0

Dual-pixel defocus blur enables absolute scale estimation in SfM without reference objects or calibration.

Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.

LiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

LiftFormer transforms monocular depth prediction into depth-oriented geometric and edge-aware subspace representations via lifting and frame theory, achieving state-of-the-art results on standard datasets.

EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction

cs.CV · 2026-03-25 · unverdicted · novelty 7.0

EndoVGGT uses a dynamic DeGAT graph attention module to improve depth estimation and non-rigid 3D reconstruction in surgery, reporting 24.6% PSNR and 9.1% SSIM gains on SCARED with zero-shot generalization to new domains.

VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

cs.CV · 2026-03-19 · unverdicted · novelty 7.0

VGGT-360 delivers geometry-consistent zero-shot panoramic depth by converting panoramas into multi-view 3D reconstructions via VGGT models and three plug-and-play correction modules, then reprojecting the result.

RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes

cs.CV · 2026-02-10 · unverdicted · novelty 7.0

RAD retrieves semantically similar RGB-D context samples for low-confidence regions and fuses them via matched cross-attention to cut relative absolute depth error by 29.2% on NYU Depth v2 underrepresented classes while staying competitive on standard benchmarks.

Bridging Visual and Wireless Sensing via a Unified Radiation Field for 3D Radio Map Construction

cs.NI · 2026-01-27 · unverdicted · novelty 7.0

URF-GS creates a single radiation field from visual and wireless observations via 3D Gaussian splatting to predict radio signals at any location and configuration with higher accuracy and fewer samples than prior NeRF approaches.

Geometry-Aware Cross Modal Alignment for Light Field-LiDAR Semantic Segmentation

cs.CV · 2025-10-08 · unverdicted · novelty 7.0

Proposes the first light field-LiDAR semantic segmentation dataset and the Mlpfseg network, which improves mIoU by 1.71 over image-only and 2.38 over point-cloud-only baselines via feature completion and depth perception modules.

Materialist: Physically Based Editing Using Single-Image Inverse Rendering

cs.CV · 2025-01-07 · unverdicted · novelty 7.0

Materialist performs single-image inverse rendering via neural-initialized progressive differentiable rendering to enable physically consistent material editing, object insertion, relighting, and transparency edits without full scene geometry.

3D-VLA: A 3D Vision-Language-Action Generative World Model

cs.CV · 2024-03-14 · unverdicted · novelty 7.0

3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.

Enabling Extensible Embodied Capabilities with Tools

cs.RO · 2026-05-26 · unverdicted · novelty 6.0

Introduces Embodied Tool Protocol and tool externalization to improve embodied AI performance on perception and cognition tasks, with measured gains but limits on execution capabilities.

Unified Panoramic Geometry Estimation via Multi-View Foundation Models

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

PaGeR is a framework that lifts perspective 3D foundation models to omnidirectional images through mixed training, enabling unified prediction of scale-invariant depth, metric depth, surface normals, and sky masks from single panoramas.

UfM*: Uncertainty from Motion* for DNN Depth Estimation Using Gaussians

cs.RO · 2026-05-21 · unverdicted · novelty 6.0

UfM* uses Gaussian mixtures to compute multiview disagreement for uncertainty in depth estimation with single inference per image, reducing energy and memory use.

Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

Decouples semantic and spatial tokens in NVS transformers to resolve representation ambiguity, yielding consistent gains with near-zero added latency.

Unlocking Dense Metric Depth Estimation in VLMs

cs.CV · 2026-05-15 · unverdicted · novelty 6.0 · 2 refs

DepthVLM converts a standard VLM into a dense metric depth predictor by attaching a lightweight head and training under unified vision-text supervision, outperforming prior VLMs and some pure vision models on a new indoor-outdoor benchmark.

Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentation, surface normal estimation, and semantic segmentation.

Last-Layer-Centric Feature Recombination: Unleashing 3D Geometric Knowledge in DINOv3 for Monocular Depth Estimation

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

Layer analysis of DINOv3 shows non-uniform 3D geometric knowledge concentrated in deeper layers, enabling a last-layer-centric recombination module that improves monocular depth estimation accuracy to state-of-the-art levels.

SS3D: End2End Self-Supervised 3D from Web Videos

cs.CV · 2026-04-24 · unverdicted · novelty 6.0 · 3 refs

SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior self-supervised baselines.

In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting

cs.CV · 2026-04-07 · unverdicted · novelty 6.0

A selective regularization framework lets scale-ambiguous monocular depth priors improve Gaussian Splatting geometry and rendering by isolating and supervising only ill-posed regions.

citing papers explorer

Showing 50 of 54 citing papers.

Honey, I Shrunk the Arc de Triomphe! cs.CV · 2026-06-01 · unverdicted · none · ref 2 · internal anchor
MetricScenes dataset from web photos and stereo imagery, plus a two-stage Poisson depth completion method, allows fine-tuning MoGe-2 to mitigate scale-collapse in metric monocular geometry while preserving benchmark performance.
SeeGroup: Multi-Layer Depth Estimation of Transparent Surfaces via Self-Determined Grouping cs.CV · 2026-05-27 · unverdicted · none · ref 5 · internal anchor
SeeGroup formulates per-pixel multi-layer depth as a point process with permutation-invariant likelihood to support arbitrary groupings, raising quadruplet relative depth accuracy from 61.34% to 70.09% on the LayeredDepth benchmark.
H-Flow: Self-supervised Human Scene Flow via Physics-inspired Joint Multi-modal Learning cs.CV · 2026-05-21 · unverdicted · none · ref 86 · internal anchor
H-Flow learns dense human scene flow from monocular video via joint pose and depth prediction in a multi-head transformer, using physics-inspired geometric and biomechanical priors for self-supervision, and introduces the DynAct4D synthetic benchmark.
Depth2Pose: A Pose-Based Benchmark for Monocular Depth Estimation without Ground-Truth Depth cs.CV · 2026-05-19 · unverdicted · none · ref 32 · internal anchor
Depth2Pose is a new evaluation framework for monocular depth estimators that uses relative camera pose accuracy as a task-driven proxy and introduces the D2P dataset of challenging out-of-distribution scenes.
LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World cs.CV · 2026-05-06 · unverdicted · none · ref 4 · internal anchor
LAMP tracks 3D human motion from moving multi-camera headsets by converting 2D detections to a unified metric 3D world frame via device localization and fitting with an end-to-end spatio-temporal transformer.
DP-SfM: Dual-Pixel Structure-from-Motion without Scale Ambiguity cs.CV · 2026-05-03 · unverdicted · none · ref 12 · internal anchor
Dual-pixel defocus blur enables absolute scale estimation in SfM without reference objects or calibration.
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors cs.CV · 2026-04-14 · unverdicted · none · ref 2 · internal anchor
A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
LiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation cs.CV · 2026-04-08 · unverdicted · none · ref 54 · internal anchor
LiftFormer transforms monocular depth prediction into depth-oriented geometric and edge-aware subspace representations via lifting and frame theory, achieving state-of-the-art results on standard datasets.
EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction cs.CV · 2026-03-25 · unverdicted · none · ref 3 · internal anchor
EndoVGGT uses a dynamic DeGAT graph attention module to improve depth estimation and non-rigid 3D reconstruction in surgery, reporting 24.6% PSNR and 9.1% SSIM gains on SCARED with zero-shot generalization to new domains.
VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation cs.CV · 2026-03-19 · unverdicted · none · ref 6 · internal anchor
VGGT-360 delivers geometry-consistent zero-shot panoramic depth by converting panoramas into multi-view 3D reconstructions via VGGT models and three plug-and-play correction modules, then reprojecting the result.
RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes cs.CV · 2026-02-10 · unverdicted · none · ref 4 · internal anchor
RAD retrieves semantically similar RGB-D context samples for low-confidence regions and fuses them via matched cross-attention to cut relative absolute depth error by 29.2% on NYU Depth v2 underrepresented classes while staying competitive on standard benchmarks.
Bridging Visual and Wireless Sensing via a Unified Radiation Field for 3D Radio Map Construction cs.NI · 2026-01-27 · unverdicted · none · ref 31 · internal anchor
URF-GS creates a single radiation field from visual and wireless observations via 3D Gaussian splatting to predict radio signals at any location and configuration with higher accuracy and fewer samples than prior NeRF approaches.
Geometry-Aware Cross Modal Alignment for Light Field-LiDAR Semantic Segmentation cs.CV · 2025-10-08 · unverdicted · none · ref 46 · internal anchor
Proposes the first light field-LiDAR semantic segmentation dataset and the Mlpfseg network, which improves mIoU by 1.71 over image-only and 2.38 over point-cloud-only baselines via feature completion and depth perception modules.
Materialist: Physically Based Editing Using Single-Image Inverse Rendering cs.CV · 2025-01-07 · unverdicted · none · ref 6 · internal anchor
Materialist performs single-image inverse rendering via neural-initialized progressive differentiable rendering to enable physically consistent material editing, object insertion, relighting, and transparency edits without full scene geometry.
3D-VLA: A 3D Vision-Language-Action Generative World Model cs.CV · 2024-03-14 · unverdicted · none · ref 2 · internal anchor
3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
Enabling Extensible Embodied Capabilities with Tools cs.RO · 2026-05-26 · unverdicted · none · ref 3 · internal anchor
Introduces Embodied Tool Protocol and tool externalization to improve embodied AI performance on perception and cognition tasks, with measured gains but limits on execution capabilities.
Unified Panoramic Geometry Estimation via Multi-View Foundation Models cs.CV · 2026-05-25 · unverdicted · none · ref 3 · internal anchor
PaGeR is a framework that lifts perspective 3D foundation models to omnidirectional images through mixed training, enabling unified prediction of scale-invariant depth, metric depth, surface normals, and sky masks from single panoramas.
UfM*: Uncertainty from Motion* for DNN Depth Estimation Using Gaussians cs.RO · 2026-05-21 · unverdicted · none · ref 2 · internal anchor
UfM* uses Gaussian mixtures to compute multiview disagreement for uncertainty in depth estimation with single inference per image, reducing energy and memory use.
Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling cs.CV · 2026-05-18 · unverdicted · none · ref 2 · internal anchor
Decouples semantic and spatial tokens in NVS transformers to resolve representation ambiguity, yielding consistent gains with near-zero added latency.
Unlocking Dense Metric Depth Estimation in VLMs cs.CV · 2026-05-15 · unverdicted · none · ref 5 · 2 links · internal anchor
DepthVLM converts a standard VLM into a dense metric depth predictor by attaching a lightweight head and training under unified vision-text supervision, outperforming prior VLMs and some pure vision models on a new indoor-outdoor benchmark.
Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners cs.CV · 2026-04-29 · unverdicted · none · ref 40 · internal anchor
LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentation, surface normal estimation, and semantic segmentation.
Last-Layer-Centric Feature Recombination: Unleashing 3D Geometric Knowledge in DINOv3 for Monocular Depth Estimation cs.CV · 2026-04-29 · unverdicted · none · ref 4 · internal anchor
Layer analysis of DINOv3 shows non-uniform 3D geometric knowledge concentrated in deeper layers, enabling a last-layer-centric recombination module that improves monocular depth estimation accuracy to state-of-the-art levels.
SS3D: End2End Self-Supervised 3D from Web Videos cs.CV · 2026-04-24 · unverdicted · none · ref 4 · 3 links · internal anchor
SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior self-supervised baselines.
In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting cs.CV · 2026-04-07 · unverdicted · none · ref 3 · internal anchor
A selective regularization framework lets scale-ambiguous monocular depth priors improve Gaussian Splatting geometry and rendering by isolating and supervising only ill-posed regions.
Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas cs.CV · 2026-03-30 · unverdicted · none · ref 3 · internal anchor
Stepper uses stepwise panoramic expansion with a multi-view 360-degree diffusion model and geometry reconstruction to produce high-fidelity, structurally consistent immersive 3D scenes from text.
R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection cs.CV · 2026-03-12 · unverdicted · none · ref 2 · internal anchor
R4Det fuses 4D radar and camera inputs via panoramic depth fusion, deformable gated temporal fusion without ego pose, and instance-guided refinement to reach state-of-the-art 3D detection on TJ4DRadSet and VoD.
OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness cs.CV · 2026-02-22 · unverdicted · none · ref 3 · internal anchor
OpenVO estimates ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras by encoding temporal dynamics in a two-frame regression framework and using 3D priors from foundation models, delivering over 20% gains and 46-92% lower errors on KITTI, nuScenes, and A
GeCo: Evaluating Geometric Consistency for Video Generation via Motion and Structure cs.CV · 2025-12-25 · unverdicted · none · ref 4 · internal anchor
GeCo is a new geometry-based metric that produces dense maps of motion and structure inconsistencies in video generation by fusing residual motion and depth priors.
Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles cs.CV · 2025-12-03 · unverdicted · none · ref 1 · internal anchor
ThinkDeeper introduces a world-model-based reasoning step that predicts future spatial states to improve multimodal visual grounding for autonomous vehicles, achieving top results on Talk2Car and other benchmarks.
Depth Anything V2 cs.CV · 2024-06-13 · unverdicted · none · ref 6 · internal anchor
Depth Anything V2 delivers finer, more robust monocular depth predictions by replacing real labeled images with synthetic data, scaling the teacher model, and using large-scale pseudo-labeled real images for student training.
Neural Surface Reconstruction from Sparse Views Using Epipolar Geometry cs.CV · 2024-06-06 · unverdicted · none · ref 2 · internal anchor
EpiS improves generalizable neural surface reconstruction from sparse views by guiding epipolar feature aggregation with cost volumes, using an epipolar transformer, and applying pretrained monocular depth constraints, outperforming prior methods on DTU and BlendedMVS.
VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching cs.CV · 2026-05-29 · unverdicted · none · ref 4 · internal anchor
VolFill uses a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into latent space and a latent Diffusion Transformer to denoise complete scenes, conditioned on geometry foundation models, outperforming baselines on SCRREAM and NRGB-D datasets.
PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction cs.RO · 2026-05-20 · unverdicted · none · ref 4 · internal anchor
PointACT proposes a 3D-aware dual-system VLA policy using multi-scale point-action interaction with bottleneck window self-attention, achieving 10% higher success rates on RLBench-10Tasks over prior pretrained VLAs.
Understanding Model Behavior in Monocular Polyp Sizing cs.CV · 2026-05-19 · accept · none · ref 3 · internal anchor
Monocular polyp sizing models achieve moderate performance by exploiting examination behavior cues rather than true metric scales, with scale information and segmentation robustness acting as independent bottlenecks.
Efficient 3D Content Reconstruction and Generation cs.CV · 2026-05-18 · unverdicted · none · ref 20 · internal anchor
Presents Instant3D for rapid text/image-to-3D generation via multi-view diffusion plus feed-forward reconstruction, and FastMap for 10x faster structure-from-motion with comparable accuracy.
Mono-Hydra++: Real-Time Monocular Scene Graph Construction with Multi-Task Learning for 3D Indoor Mapping cs.RO · 2026-05-17 · unverdicted · none · ref 19 · internal anchor
Mono-Hydra++ is a monocular RGB-IMU pipeline that constructs hierarchical 3D scene graphs in real time while reporting lower trajectory error than some RGB-D baselines on indoor datasets.
Pose-Aware Diffusion for 3D Generation cs.CV · 2026-05-01 · unverdicted · none · ref 2 · internal anchor
PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.
Learning from the Unseen: Generative Data Augmentation for Geometric-Semantic Accident Anticipation cs.CV · 2026-04-29 · unverdicted · none · ref 50 · internal anchor
A generative video synthesis pipeline paired with a semantic graph neural network yields gains in accident anticipation accuracy and lead time on driving datasets, accompanied by a new benchmark release.
Enhancing Hazy Wildlife Imagery: AnimalHaze3k and IncepDehazeGan cs.CV · 2026-04-17 · conditional · none · ref 1 · internal anchor
A new wildlife-specific hazy image dataset and IncepDehazeGan model that reports state-of-the-art dehazing metrics and more than doubles downstream animal detection performance.
Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction cs.CV · 2026-04-03 · unverdicted · none · ref 6 · internal anchor
A multilevel perceptual CRF model using Swin Transformer, HPF fusion, HA adapters, and dynamic scaling attention achieves state-of-the-art monocular depth estimation on NYU Depth v2, KITTI, and MatterPort3D with reduced error and fast inference.
Geometry-Aware Scene Configurations for Novel View Synthesis cs.CV · 2025-10-10 · unverdicted · none · ref 3 · internal anchor
Geometry-guided adaptive placement of bases and virtual viewpoints improves rendering quality and memory use over uniform arrangements in scalable NeRF for large indoor scenes.
ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving cs.CV · 2025-08-19 · unverdicted · none · ref 37 · internal anchor
ROVR is a new diverse depth dataset for autonomous driving with 200K frames, released pipelines, and ablations showing sparse ground truth supports model training.
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details cs.CV · 2025-07-03 · unverdicted · none · ref 5 · internal anchor
MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.
UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler cs.CV · 2025-02-27 · conditional · none · ref 39 · internal anchor
UniDepthV2 predicts metric 3D points directly from single images using a self-promptable camera module, pseudo-spherical representation, and new losses for improved cross-domain generalization.
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model cs.RO · 2025-01-27 · unverdicted · none · ref 4 · internal anchor
SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 million real-world episodes.
DepthMaster: Taming Diffusion Models for Monocular Depth Estimation cs.CV · 2025-01-05 · unverdicted · none · ref 11 · internal anchor
DepthMaster proposes a single-step diffusion model with Feature Alignment and Fourier Enhancement modules in a two-stage training process to improve generalization and detail preservation in monocular depth estimation over prior diffusion methods.
MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion cs.CV · 2024-10-04 · unverdicted · none · ref 46 · internal anchor
By fine-tuning DUST3R to output per-timestep pointmaps on scarce dynamic video datasets, MonST3R achieves stronger video depth and pose estimation without explicit motion modeling.
Towards Robust Surgical Automation via Digital Twin Representations from Foundation Models cs.RO · 2024-09-19 · unverdicted · none · ref 48 · internal anchor
Digital twin representations from vision foundation models enable LLM-based planning for robust peg transfer and gauze retrieval on the dVRK surgical platform with claimed generalizability.
Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos cs.CV · 2026-05-26 · unverdicted · none · ref 2 · internal anchor
HTD-Refine uses a temporal transformer (PVA-Net) to predict high-order dynamics and refines HMR outputs via optimization for more natural motion.
AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation cs.CV · 2026-05-10 · unverdicted · none · ref 3 · internal anchor
AtteConDA adds attention-based conflict suppression to multi-condition diffusion models so that generated driving-scene images retain richer structural cues from the original annotations.

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer