arxiv: 2410.02073 · v2 · submitted 2024-10-02 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Aleksei Bochkovskii , Ama\"el Delaunoy , Hugo Germain , Marcel Santos , Yichao Zhou , Stephan R. Richter , Vladlen Koltun

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:40 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords monocular depth estimationmetric depthzero-shot depthvision transformerboundary accuracyfocal length estimationhigh-resolution depth maps

0 comments

The pith

Depth Pro produces sharp, metric-scale depth maps from single images in 0.3 seconds without any camera metadata.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Depth Pro as a foundation model for zero-shot metric monocular depth estimation. It generates high-resolution depth maps that preserve fine details and absolute scale even when no camera information is supplied. The approach combines an efficient multi-scale vision transformer with training on mixed real and synthetic data to balance accuracy and edge sharpness. This combination allows the model to run fast enough for real-time use while delivering outputs that outperform earlier methods on standard benchmarks. The work also adds new evaluation measures focused on boundary quality and includes single-image focal length estimation as a supporting capability.

Core claim

Depth Pro synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. These characteristics are enabled by an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image.

What carries the argument

An efficient multi-scale vision transformer for dense prediction paired with a mixed real-synthetic training protocol that jointly optimizes metric scale and boundary fidelity.

If this is right

The model generates 2.25-megapixel depth maps in 0.3 seconds on a standard GPU.
Depth estimates remain metric and absolute without camera intrinsics or other metadata.
Boundary accuracy improves measurably through the dedicated evaluation metrics introduced.
Single-image focal length estimation reaches state-of-the-art levels as a byproduct.
Overall performance exceeds prior monocular depth methods across multiple accuracy dimensions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Single-image depth systems could now be deployed in settings where camera calibration data is unavailable or unreliable.
The emphasis on boundary sharpness suggests the outputs may integrate more cleanly into downstream 3D reconstruction pipelines.
Hybrid real-synthetic training may generalize to other dense prediction tasks that require both metric consistency and fine detail preservation.

Load-bearing premise

The training protocol that mixes real and synthetic datasets together with the multi-scale vision transformer succeeds at delivering both accurate absolute scale and sharp boundaries in zero-shot settings without camera intrinsics.

What would settle it

A collection of real-world images with independently measured ground-truth metric depths and focal lengths where Depth Pro produces large scale errors or visibly blurred object boundaries when run without any camera metadata.

read the original abstract

We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image. Extensive experiments analyze specific design choices and demonstrate that Depth Pro outperforms prior work along multiple dimensions. We release code and weights at https://github.com/apple/ml-depth-pro

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Depth Pro delivers fast high-res metric depth without intrinsics via a multi-scale transformer and mixed training, but the absolute scale claim rests on focal length estimation that needs clearer validation.

read the letter

Depth Pro is a new model for estimating metric depth from a single image in under a second, with sharp details and no need for camera metadata. That's the core practical advance here. The paper introduces an efficient multi-scale vision transformer tailored for this dense prediction task. It combines real and synthetic data in training to get both accurate scale and fine boundaries. They also add dedicated metrics for assessing boundary accuracy in depth maps and improve on single-image focal length estimation, claiming state-of-the-art there. The experiments cover design choices and compare against prior work, showing better results across speed, sharpness, and accuracy. Making the code and weights public is a plus for reproducibility. This setup avoids some common pitfalls in monocular depth by focusing on metric outputs directly. The training protocol seems well thought out for balancing the data sources. The potential weak point is the reliance on the focal length estimator for the absolute scale. As noted in the stress test, if that component doesn't generalize well to varied real scenes, the metric depths could have significant errors. The paper needs to provide clear breakdowns of focal length accuracy on diverse test sets and how errors propagate to depth metrics. Without that, the zero-shot metric claim is harder to fully trust. This work is relevant for computer vision folks working on scene understanding, robotics, or AR who want fast, high-quality depth without extra inputs. A reader looking for efficient transformer architectures for prediction tasks would find the details useful. It deserves peer review. The contributions are specific and the release allows others to test the claims directly.

Referee Report

2 major / 1 minor

Summary. The paper presents Depth Pro, a foundation model for zero-shot metric monocular depth estimation. It claims to synthesize high-resolution depth maps with high sharpness and fine details, providing absolute metric scale from a single image without camera intrinsics or metadata. The model runs in 0.3 seconds for 2.25-megapixel outputs on standard GPUs. Key contributions include an efficient multi-scale vision transformer for dense prediction, a training protocol mixing real and synthetic data for metric accuracy and boundary precision, new dedicated metrics for boundary accuracy, and state-of-the-art single-image focal length estimation. Extensive experiments are said to demonstrate outperformance over prior work along multiple dimensions, with code and weights released.

Significance. If the central claims hold, the work would be significant for computer vision applications requiring fast, high-quality metric depth from monocular images without calibration data, such as robotics, AR, and 3D reconstruction. The speed, resolution, and zero-shot metric capability without intrinsics represent a practical advance over prior methods. The open release of code and weights is a clear strength for reproducibility and further research.

major comments (2)

[Experiments (focal length and metric depth evaluation)] The metric-scale claim without camera intrinsics rests on the single-image focal length estimator (described as SOTA in the abstract). No section isolates focal-length prediction error on diverse real-world test sets or quantifies its propagation into absolute depth metrics; if MAE exceeds ~10-15% on out-of-distribution scenes, the reported metric gains would be undermined.
[Method and Experiments sections] The training protocol combining real and synthetic datasets is asserted to achieve both high metric accuracy and fine boundary tracing in zero-shot settings, but the manuscript provides no ablation that separates the contribution of the multi-scale vision transformer from the data mixture or the new boundary metrics.

minor comments (1)

[Abstract] The abstract states outperformance 'along multiple dimensions' but does not preview any quantitative numbers, error bars, or dataset names; this should be added for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the focal length evaluation and ablation studies. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments (focal length and metric depth evaluation)] The metric-scale claim without camera intrinsics rests on the single-image focal length estimator (described as SOTA in the abstract). No section isolates focal-length prediction error on diverse real-world test sets or quantifies its propagation into absolute depth metrics; if MAE exceeds ~10-15% on out-of-distribution scenes, the reported metric gains would be undermined.

Authors: We agree that isolating the focal length estimator's accuracy and its effect on metric depth is important for validating the zero-shot claims. The current manuscript reports overall depth metrics and states SOTA focal length performance, but does not include a dedicated breakdown. In the revision, we will add a new subsection in the Experiments section reporting focal length MAE and relative error on multiple real-world datasets (NYU, KITTI, ETH3D, and others) and will quantify propagation by recomputing depth metrics using predicted versus ground-truth focal lengths where available. This will directly address potential error accumulation in out-of-distribution scenes. revision: yes
Referee: [Method and Experiments sections] The training protocol combining real and synthetic datasets is asserted to achieve both high metric accuracy and fine boundary tracing in zero-shot settings, but the manuscript provides no ablation that separates the contribution of the multi-scale vision transformer from the data mixture or the new boundary metrics.

Authors: The manuscript includes targeted experiments on design choices and overall performance, but we acknowledge that it lacks fully disentangled ablations separating the multi-scale ViT architecture, the real+synthetic data mixture, and the boundary-specific losses/metrics. In the revised version, we will expand the Experiments section with additional ablation tables that train and evaluate controlled variants (e.g., single-scale vs. multi-scale, real-only vs. mixed data, with vs. without boundary terms) to clearly attribute the gains in metric accuracy and boundary precision to each component. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper trains a multi-scale vision transformer on external real and synthetic datasets to produce metric depth maps, with focal length estimation presented as an auxiliary SOTA component rather than a self-referential fit. No equation or claim reduces a prediction to its own inputs by construction, nor does any load-bearing step rely on a self-citation chain that itself lacks independent verification. Evaluation uses newly proposed boundary metrics on held-out benchmarks, keeping the central zero-shot metric claim externally falsifiable and independent of the model's fitted parameters.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard deep learning assumptions for vision transformers and the empirical effectiveness of mixing real and synthetic training data; no free parameters, axioms, or invented entities are explicitly introduced beyond typical model training.

pith-pipeline@v0.9.0 · 5479 in / 1121 out tokens · 46377 ms · 2026-05-14T20:40:37.299863+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction
cs.CV 2026-05 unverdicted novelty 7.0

AmbiSuR adds intrinsic photometric disambiguation and a self-indication module to Gaussian Splatting to resolve ambiguities and improve surface reconstruction accuracy.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Monocular Depth Estimation via Neural Network with Learnable Algebraic Group and Ring Structures
cs.CV 2026-04 unverdicted novelty 7.0

LAGRNet embeds learnable algebraic group, ring, and sheaf structures into a neural network to improve accuracy and generalization in monocular depth estimation.
LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation
cs.CV 2026-04 unverdicted novelty 7.0

A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.
Globally Optimal Pose from Orthographic Silhouettes
cs.CV 2026-04 unverdicted novelty 7.0

A search-based algorithm achieves globally optimal pose estimation from silhouettes alone by querying precomputed area response surfaces and auxiliary ellipse aspect ratios for any shape.
3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image
cs.CV 2026-04 unverdicted novelty 7.0

3D-Fixer performs in-place 3D asset completion from single-view partial point clouds via coarse-to-fine generation with ORFA conditioning, plus a new ARSG-110K dataset, to achieve higher geometric accuracy than MIDI a...
Training a Student Expert via Semi-Supervised Foundation Model Distillation
cs.CV 2026-04 conditional novelty 7.0

A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.
HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits
cs.CV 2026-04 unverdicted novelty 7.0

HairOrbit leverages video generation priors and a neural orientation extractor to achieve state-of-the-art strand-level 3D hair reconstruction from single-view portraits in visible and invisible regions.
Multi-Modal Guided Multi-Source Domain Adaptation for Object Detection
cs.CV 2026-05 unverdicted novelty 6.0

MS-DePro achieves state-of-the-art performance on multi-source domain adaptation benchmarks for object detection by using depth-guided region proposals and multi-modal alignment of learnable text embeddings.
A Data Efficiency Study of Synthetic Fog for Object Detection Using the Clear2Fog Pipeline
cs.CV 2026-05 unverdicted novelty 6.0

Clear2Fog generates realistic synthetic fog from clear scenes, enabling mixed-density training that outperforms full fixed-density data and improves real-world performance by 1.67 mAP after learning-rate adjustment.
GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

GeoQuery replaces corrupted rendering features with geometry-aligned proxy queries and restricts cross-view attention to local windows, enabling robust diffusion-based refinement under extreme view sparsity.
Target-depth sensing with metasurface-encoder integrated optoelectronic neural network
physics.optics 2026-04 unverdicted novelty 6.0

A metasurface optical encoder compresses depth into 2D images for a shadow ResNet to achieve high accuracy in both target classification and depth estimation on MNIST and vehicle datasets.
MLG-Stereo: ViT Based Stereo Matching with Multi-Stage Local-Global Enhancement
cs.CV 2026-04 unverdicted novelty 6.0

MLG-Stereo adds multi-granularity feature extraction, local-global cost volumes, and guided recurrent refinement to ViT stereo matching, yielding competitive results on Middlebury, KITTI-2015, and strong results on KI...
Image Generators are Generalist Vision Learners
cs.CV 2026-04 unverdicted novelty 6.0

Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.
Image Generators are Generalist Vision Learners
cs.CV 2026-04 conditional novelty 6.0

Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.
In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting
cs.CV 2026-04 unverdicted novelty 6.0

A selective regularization framework lets scale-ambiguous monocular depth priors improve Gaussian Splatting geometry and rendering by isolating and supervising only ill-posed regions.
Depth Anything 3: Recovering the Visual Space from Any Views
cs.CV 2025-11 unverdicted novelty 6.0

DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
The Midas Touch for Metric Depth
cs.CV 2026-05 unverdicted novelty 5.0

MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.
Sapiens2
cs.CV 2026-04 unverdicted novelty 5.0

Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and...
Qwen-Image Technical Report
cs.CV 2025-08 unverdicted novelty 5.0

Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive en...
Genie Sim PanoRecon: Fast Immersive Scene Generation from Single-View Panorama
cs.RO 2026-04 unverdicted novelty 4.0

A feed-forward Gaussian-splatting system reconstructs photo-realistic 3D scenes from single-view panoramas in seconds via cube-map decomposition and depth-aware fusion for robotic simulation use.
Depth-Aware Rover: A Study of Edge AI and Monocular Vision for Real-World Implementation
cs.CV 2026-04 unverdicted novelty 3.0

Monocular depth estimation with UniDepthV2 on Raspberry Pi enables cost-effective rover navigation, proving more robust than stereo vision in real-world tests at 0.1 FPS depth and 10 FPS detection.

Reference graph

Works this paper leans on

294 extracted references · 200 canonical work pages · cited by 21 Pith papers

[1]

ECCV , year=

Defocus deblurring using dual-pixel data , author=. ECCV , year=

work page
[2]

RCA engineer , year=

Pyramid methods in image processing , author=. RCA engineer , year=

work page
[3]

2022 , journal=

Attention Attention Everywhere: Monocular Depth Prediction with Skip Attention , author=. 2022 , journal=

work page 2022
[4]

ICML , year=

Unilmv2: Pseudo-masked language models for unified language model pre-training , author=. ICML , year=

work page
[5]

Hangbo Bao and Li Dong and Songhao Piao and Furu Wei , booktitle=

work page
[6]

CVPR , year =

Baradad, Manel and Torralba, Antonio , title =. CVPR , year =

work page
[7]

Bauer, Zuria and Gomez-Donoso, Francisco and Cruz, Edmanuel and Orts-Escolano, Sergio and Cazorla, Miguel , journal=

work page
[8]

arXiv , year =

Shariq Farooq Bhat and Reiner Birkl and Diana Wofk and Peter Wonka and Matthias M. arXiv , year =

work page
[9]

ECCV , year =

Shariq Farooq Bhat and Ibraheem Alhashim and Peter Wonka , title =. ECCV , year =

work page
[10]

CVPR , year =

Shariq Farooq Bhat and Ibraheem Alhashim and Peter Wonka , title =. CVPR , year =

work page
[11]

arXiv , year =

Reiner Birkl and Diana Wofk and Matthias M. arXiv , year =

work page
[12]

Black and Priyanka Patel and Joachim Tesch and Jinlong Yang , title =

Michael J. Black and Priyanka Patel and Joachim Tesch and Jinlong Yang , title =. CVPR , year =

work page
[13]

ECCV , year =

A naturalistic open source movie for optical flow evaluation , author =. ECCV , year =

work page
[14]

Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs , booktitle =

Vladimir Bychkovsky and Sylvain Paris and Eric Chan and Fr. Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs , booktitle =

work page
[15]

Lang and Sourabh Vora and Venice Erin Liong and Qiang Xu and Anush Krishnan and Yu Pan and Giancarlo Baldan and Oscar Beijbom , title =

Holger Caesar and Varun Bankiti and Alex H. Lang and Sourabh Vora and Venice Erin Liong and Qiang Xu and Anush Krishnan and Yu Pan and Giancarlo Baldan and Oscar Beijbom , title =. CVPR , year =

work page
[16]

ICCV , year =

Han Cai and Junyan Li and Muyan Hu and Chuang Gan and Song Han , title =. ICCV , year =

work page
[17]

Pix2Video: Video Editing using Image Diffusion , booktitle =

Duygu Ceylan and Chun. Pix2Video: Video Editing using Image Diffusion , booktitle =

work page
[18]

NIPS , year =

Weifeng Chen and Zhao Fu and Dawei Yang and Jia Deng , title =. NIPS , year =

work page
[19]

ICLR , year=

Vision transformer adapter for dense predictions , author=. ICLR , year=

work page
[20]

Cheng, Ho Kei and Chung, Jihoon and Tai, Yu-Wing and Tang, Chi-Keung , booktitle=

work page
[21]

NeurIPS , year =

Xiangxiang Chu and Zhi Tian and Yuqing Wang and Bo Zhang and Haibing Ren and Xiaolin Wei and Huaxia Xia and Chunhua Shen , title =. NeurIPS , year =

work page
[22]

Chang and Manolis Savva and Maciej Halber and Thomas A

Angela Dai and Angel X. Chang and Manolis Savva and Maciej Halber and Thomas A. Funkhouser and Matthias Nie. CVPR , year =

work page
[23]

Dang-Nguyen, Duc-Tien and Pasquini, Cecilia and Conotter, Valentina and Boato, Giulia , booktitle=

work page
[24]

NeurIPS Datasets & Benchmarks , year =

Afshin Dehghan and Gilad Baruch and Zhuoyuan Chen and Yuri Feigin and Peter Fu and Thomas Gebauer and Daniel Kurz and Tal Dimry and Brandon Joffe and Arik Schwartz and Elad Shulman , title =. NeurIPS Datasets & Benchmarks , year =

work page
[25]

Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle=

work page
[26]

Dodgson , title =

Neil A. Dodgson , title =. Stereoscopic Displays and Virtual Reality Systems XI , year =

work page
[27]

ICLR , year =

Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby , title =. ICLR , year =

work page
[28]

Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in

Ding, Xiaohan and Zhang, Xiangyu and Han, Jungong and Ding, Guiguang , booktitle=. Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in

work page
[29]

ICCV , year =

Ainaz Eftekhar and Alexander Sax and Jitendra Malik and Amir Zamir , title =. ICCV , year =

work page
[30]

NIPS , year =

David Eigen and Christian Puhrsch and Rob Fergus , title =. NIPS , year =

work page
[31]

ICCV , year =

David Eigen and Rob Fergus , title =. ICCV , year =

work page
[32]

ICCV , year =

Fan, Haoqi and Xiong, Bo and Mangalam, Karttikeya and Li, Yanghao and Yan, Zhicheng and Malik, Jitendra and Feichtenhofer, Christoph , title =. ICCV , year =

work page
[33]

CVPR , year=

Fang, Yuming and Zhu, Hanwei and Zeng, Yan and Ma, Kede and Wang, Zhou , title=. CVPR , year=

work page
[34]

CVPR , year =

Huan Fu and Mingming Gong and Chaohui Wang and Kayhan Batmanghelich and Dacheng Tao , title =. CVPR , year =

work page
[35]

GitHub repository , howpublished =

fvcore , year =. GitHub repository , howpublished =

work page
[36]

CVPR , year =

Adrien Gaidon and Qiao Wang and Yohann Cabon and Eleonora Vig , title =. CVPR , year =

work page
[37]

IJRR , volume =

Andreas Geiger and Philip Lenz and Christoph Stiller and Raquel Urtasun , title =. IJRR , volume =

work page
[38]

All for One, and One for All:

Jose Luis G. All for One, and One for All:. arXiv , year =

work page
[39]

Gordon and Trevor Darrell and Michael Harville and John Woodfill , title =

Gaile G. Gordon and Trevor Darrell and Michael Harville and John Woodfill , title =. CVPR , year =

work page
[40]

Fischer and Ulrich Prestel and Pingchuan Ma and Dmytro Kotovenko and Olga Grebenkova and Stefan Andreas Baumann and Vincent Tao Hu and Bj

Ming Gui and Johannes S. Fischer and Ulrich Prestel and Pingchuan Ma and Dmytro Kotovenko and Olga Grebenkova and Stefan Andreas Baumann and Vincent Tao Hu and Bj. AAAI , year =

work page
[41]

ICCV , year =

Vitor Guizilini and Igor Vasiljevic and Dian Chen and Rares Ambrus and Adrien Gaidon , title =. ICCV , year =

work page
[42]

CVPR , year =

Vitor Guizilini and Rares Ambrus and Sudeep Pillai and Allan Raventos and Adrien Gaidon , title =. CVPR , year =

work page
[43]

CVPR , year =

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , title =. CVPR , year =

work page
[44]

CVPR , year =

He, Kaiming and Chen, Xinlei and Xie, Saining and Li, Yanghao and Doll\'ar, Piotr and Girshick, Ross , title =. CVPR , year =

work page
[45]

Peter Hedman and Suhib Alsisan and Richard Szeliski and Johannes Kopf , title =

work page
[46]

NeurIPS , year =

Denoising diffusion probabilistic models , author =. NeurIPS , year =

work page
[47]

TPAMI , volume =

Hu, Mu and Yin, Wei and Zhang, Chi and Cai, Zhipeng and Long, Xiaoxiao and Chen, Hao and Wang, Kaixuan and Yu, Gang and Shen, Chunhua and Shen, Shaojie , title=. TPAMI , volume =

work page
[48]

, title =

Huang, Gao and Liu, Zhuang and van der Maaten, Laurens and Weinberger, Kilian Q. , title =. CVPR , year =

work page
[49]

TPAMI , volume =

Xinyu Huang and Peng Wang and Xinjing Cheng and Dingfu Zhou and Qichuan Geng and Ruigang Yang , title =. TPAMI , volume =

work page
[50]

Perceiver

Andrew Jaegle and Sebastian Borgeaud and Jean. Perceiver. ICLR , year =

work page
[51]

Freeman and David Salesin and Brian Curless and Ce Liu , title =

Varun Jampani and Huiwen Chang and Kyle Sargent and Abhishek Kar and Richard Tucker and Michael Krainin and Dominik Kaeser and William T. Freeman and David Salesin and Brian Curless and Ce Liu , title =. ICCV , year =

work page
[52]

ICCV , year =

Yuanfeng Ji and Zhe Chen and Enze Xie and Lanqing Hong and Xihui Liu and Zhaoqiang Liu and Tong Lu and Zhenguo Li and Ping Luo , title =. ICCV , year =

work page
[53]

CVPR , year =

Liang, Jie and Zeng, Hui and Cui, Miaomiao and Xie, Xuansong and Zhang, Lei , title =. CVPR , year =

work page
[54]

CVPR , year =

Oguzhan Fatih Kar and Teresa Yeo and Andrei Atanov and Amir Zamir , title =. CVPR , year =

work page
[55]

CVPR , year =

Nikita Karaev and Ignacio Rocco and Benjamin Graham and Natalia Neverova and Andrea Vedaldi and Christian Rupprecht , title =. CVPR , year =

work page
[56]

CVPR , year =

Bingxin Ke and Anton Obukhov and Shengyu Huang and Nando Metzger and Rodrigo Caye Daudt and Konrad Schindler , title =. CVPR , year =

work page
[57]

2023 , journal=

MeSa: Masked, Geometric, and Supervised Pre-training for Monocular Depth Estimation , author=. 2023 , journal=

work page 2023
[58]

ICCV , year =

Numair Khan and Lei Xiao and Douglas Lanman , title =. ICCV , year =

work page
[59]

TIP , volume =

Youngjung Kim and Bumsub Ham and Changjae Oh and Kwanghoon Sohn , title =. TIP , volume =

work page
[60]

and Lo, Wan-Yen and Dollar, Piotr and Girshick, Ross , title =

Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C. and Lo, Wan-Yen and Dollar, Piotr and Girshick, Ross , title =. ICCV , year =

work page
[61]

and Tesch, Joachim and M\"uller, Lea and Hilliges, Otmar and Black, Michael J

Kocabas, Muhammed and Huang, Chun-Hao P. and Tesch, Joachim and M\"uller, Lea and Hilliges, Otmar and Black, Michael J. , booktitle =

work page
[62]

Evaluation of

Tobias Koch and Lukas Liebel and Friedrich Fraundorfer and Marco K. Evaluation of. ECCV Workshops , year =

work page
[63]

2023 , journal=

Text-image Alignment for Diffusion-based Perception , author=. 2023 , journal=

work page 2023
[64]

CVPR , year =

Anastasiia Kornilova and Marsel Faizullin and Konstantin Pakulev and Andrey Sadkov and Denis Kukushkin and Azat Akhmetyanov and Timur Akhtyamov and Hekmat Taherinejad and Gonzalo Ferrer , title =. CVPR , year =

work page
[65]

CVPR , year=

Pulling Things out of Perspective , author =. CVPR , year=

work page
[66]

arXiv , year =

Mykola Lavreniuk and Shariq Farooq Bhat and Matthias Müller and Peter Wonka , title =. arXiv , year =

work page
[67]

WACV , year =

Hoang. WACV , year =

work page
[68]

2021 , journal=

From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation , author=. 2021 , journal=

work page 2021
[69]

CVPR , year =

Youngwan Lee and Jonghee Kim and Jeffrey Willette and Sung Ju Hwang , title =. CVPR , year =

work page
[70]

CVPR , year =

Zhengqi Li and Noah Snavely , title =. CVPR , year =

work page
[71]

ICCV , year=

Scale-Aware Trident Networks for Object Detection , author=. ICCV , year=

work page
[72]

ACM-MM , year=

Privacy-Preserving Portrait Matting , author=. ACM-MM , year=

work page
[73]

CVPR , year =

Zhengqi Li and Simon Niklaus and Noah Snavely and Oliver Wang , title =. CVPR , year =

work page
[74]

ECCV , year=

Exploring plain vision transformer backbones for object detection , author=. ECCV , year=

work page
[75]

CVPR , year =

Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph , title =. CVPR , year =

work page
[76]

IJCV , volume=

Bridging composite and real: Towards end-to-end deep image matting , author=. IJCV , volume=

work page
[77]

TIP , volume =

Zhenyu Li and Xuyang Wang and Xianming Liu and Junjun Jiang , title =. TIP , volume =

work page
[78]

CVPR , year =

Zhenyu Li and Shariq Farooq Bhat and Peter Wonka , title =. CVPR , year =

work page
[79]

MIR , year=

DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation , author=. MIR , year=

work page
[80]

arXiv , year=

Deep Image Matting: A Comprehensive Survey , author=. arXiv , year=

work page

Showing first 80 references.