pith. sign in

arxiv: 2501.02576 · v2 · submitted 2025-01-05 · 💻 cs.CV

DepthMaster: Taming Diffusion Models for Monocular Depth Estimation

Pith reviewed 2026-05-23 06:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords monocular depth estimationdiffusion modelssingle-step denoisingfeature alignmentfourier enhancementtwo-stage traininggeneralization
0
0 comments X

The pith

Single-step diffusion models can match multi-step accuracy in monocular depth estimation once generative features are aligned with discriminative needs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to fix the speed-accuracy tradeoff in diffusion-based monocular depth estimation. Standard diffusion needs many denoising steps for good results, but recent single-step versions lose either generalization or fine detail because the features learned for image generation do not match what depth prediction requires. DepthMaster closes that gap with two modules and a two-stage training schedule: one module injects semantic features to stop overfitting to textures, the other uses Fourier transforms to restore missing high-frequency details while keeping global structure. The result is a model that runs at single-step speed yet reports better generalization and detail preservation than prior diffusion methods on multiple datasets.

Core claim

DepthMaster is a single-step deterministic diffusion model that adapts generative features for the discriminative task of monocular depth estimation. A Feature Alignment module incorporates high-quality semantic features to strengthen the denoising network and reduce overfitting to texture details. A Fourier Enhancement module adaptively balances low-frequency scene structure against high-frequency details. Two-stage training first emphasizes global structure via the alignment module, then refines visual quality via the Fourier module, yielding state-of-the-art generalization and detail preservation that surpasses other diffusion-based depth estimators across datasets.

What carries the argument

Feature Alignment module (injects semantic features into the denoising network) paired with Fourier Enhancement module (balances frequency content), trained in two sequential stages.

If this is right

  • Single denoising step becomes sufficient for competitive depth maps instead of requiring dozens of steps.
  • Depth estimates improve on unseen datasets without per-dataset retraining or fine-tuning.
  • Fine details in depth maps are recovered without sacrificing global scene consistency.
  • The same single-step diffusion backbone can be reused for other dense prediction tasks once the alignment and frequency modules are added.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The two-stage training pattern may serve as a general recipe for converting generative diffusion backbones into fast discriminative predictors.
  • If the feature mismatch is the dominant issue, similar alignment techniques could improve single-step diffusion for related tasks such as surface normal estimation or semantic segmentation.
  • The Fourier module's explicit frequency control suggests a route to diagnose and correct other failure modes where diffusion models lose high-frequency information under aggressive step reduction.

Load-bearing premise

The performance gap in single-step diffusion depth estimation is caused mainly by a mismatch between generative and discriminative features that these two modules can close without introducing overfitting or detail loss.

What would settle it

A controlled test in which removing either the Feature Alignment or Fourier Enhancement module produces no measurable drop in accuracy or detail on held-out datasets would falsify the claim that these modules are the effective fix.

Figures

Figures reproduced from arXiv: 2501.02576 by Bo Li, Hao Zhang, Li Liu, Peng-Tao Jiang, Ruijie Zhu, Tianzhu Zhang, Zerong Wang, Ziyang Song.

Figure 1
Figure 1. Figure 1: Visualization of different paradigms. “Denoise” refers to predicting depth in a diffusion-denoising way. Limited by the feature representation capability [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall framework of DepthMaster. RGB is first projected into the latent space by the I2L Encoder to obtain [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison with zero-shot monocular depth estimation methods across different datasets. Our model demonstrates excellent detail [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on in-the-wild examples. Our model not only recovers correct scene structure, but also exhibits fine-grained details. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Depth distribution of different depth preprocess methods on Virtual [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of predictions from two stages. With the Fourier [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Monocular depth estimation within the diffusion-denoising paradigm demonstrates impressive generalization ability but suffers from low inference speed. Recent methods adopt a single-step deterministic paradigm to improve inference efficiency while maintaining comparable performance. However, they overlook the gap between generative and discriminative features, leading to suboptimal results. In this work, we propose DepthMaster, a single-step diffusion model designed to adapt generative features for the discriminative depth estimation task. First, to mitigate overfitting to texture details introduced by generative features, we propose a Feature Alignment module, which incorporates high-quality semantic features to enhance the denoising network's representation capability. Second, to address the lack of fine-grained details in the single-step deterministic framework, we propose a Fourier Enhancement module to adaptively balance low-frequency structure and high-frequency details. We adopt a two-stage training strategy to fully leverage the potential of the two modules. In the first stage, we focus on learning the global scene structure with the Feature Alignment module, while in the second stage, we exploit the Fourier Enhancement module to improve the visual quality. Through these efforts, our model achieves state-of-the-art performance in terms of generalization and detail preservation, outperforming other diffusion-based methods across various datasets. Our project page can be found at https://indu1ge.github.io/DepthMaster_page.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces DepthMaster, a single-step deterministic diffusion model for monocular depth estimation. It proposes a Feature Alignment module that incorporates high-quality semantic features to mitigate overfitting to texture details from generative features, and a Fourier Enhancement module that adaptively balances low-frequency structure and high-frequency details. A two-stage training strategy is employed (first focusing on global structure via Feature Alignment, then on visual quality via Fourier Enhancement). The central claim is that these components close the generative-discriminative feature gap, yielding state-of-the-art generalization and detail preservation that outperforms other diffusion-based methods across various datasets.

Significance. If the reported experiments hold, the work would be significant for adapting diffusion models to discriminative tasks under single-step inference constraints. The modular design (Feature Alignment + Fourier Enhancement) and staged training provide a concrete, testable approach to the generative-discriminative mismatch that prior single-step methods overlook, with potential impact on efficient depth estimation pipelines.

major comments (1)
  1. [Abstract] Abstract: The assertion of 'state-of-the-art performance in terms of generalization and detail preservation' and 'outperforming other diffusion-based methods across various datasets' is presented with no quantitative metrics, baselines, error analysis, dataset names, or table references. This is load-bearing for the central claim and creates a verification gap that must be addressed with explicit results.
minor comments (2)
  1. The description of the two-stage training strategy would benefit from explicit loss formulations or pseudocode to clarify how the modules are activated or frozen across stages.
  2. Notation for the Fourier Enhancement module (e.g., frequency-domain operations) should be defined with equations rather than prose alone to ensure reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract. We agree that the central claims require explicit support and will revise the abstract to address the verification gap while preserving conciseness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion of 'state-of-the-art performance in terms of generalization and detail preservation' and 'outperforming other diffusion-based methods across various datasets' is presented with no quantitative metrics, baselines, error analysis, dataset names, or table references. This is load-bearing for the central claim and creates a verification gap that must be addressed with explicit results.

    Authors: We agree that the abstract would be strengthened by including explicit quantitative support. In the revised version we will add concise references to key metrics (e.g., AbsRel, RMSE on NYU Depth V2, KITTI, and ETH3D), the main baselines, and the corresponding result tables/figures. This directly addresses the verification gap without expanding the abstract beyond typical length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces two new modules (Feature Alignment, Fourier Enhancement) plus a two-stage training procedure as design choices to adapt generative features for discriminative depth estimation. No equations, uniqueness theorems, or first-principles derivations are claimed; performance claims rest on the proposed architecture and reported experiments rather than any reduction to fitted parameters or self-citation chains. The provided abstract and skeptic analysis confirm the argument is internally consistent without the enumerated circularity patterns (self-definitional, fitted-input prediction, load-bearing self-citation, etc.).

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, datasets, or implementation details; no free parameters, axioms, or invented entities can be identified or audited.

pith-pipeline@v0.9.0 · 5778 in / 1171 out tokens · 38499 ms · 2026-05-23T06:05:52.149573+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Midas Touch for Metric Depth

    cs.CV 2026-05 unverdicted novelty 5.0

    MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    Mgnet: Monocular geo- metric scene understanding for autonomous driving,

    M. Sch ¨on, M. Buchholz, and K. Dietmayer, “Mgnet: Monocular geo- metric scene understanding for autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021, pp. 15 804–15 815

  2. [2]

    Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,

    Y . Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8445–8453

  3. [3]

    Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving,

    Y . You, Y . Wang, W.-L. Chao, D. Garg, G. Pleiss, B. Hariha- ran, M. Campbell, and K. Q. Weinberger, “Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving,” arXiv preprint arXiv:1906.06310, 2019

  4. [4]

    Ro- bodepth: Robust out-of-distribution depth estimation under corruptions,

    L. Kong, S. Xie, H. Hu, B. Cottereau, L. X. Ng, and W. T. Ooi, “Ro- bodepth: Robust out-of-distribution depth estimation under corruptions,” arXiv preprint arXiv:23xx.xxxxx , 2023

  5. [5]

    Consistent video depth estimation,

    X. Luo, J.-B. Huang, R. Szeliski, K. Matzen, and J. Kopf, “Consistent video depth estimation,” ACM Transactions on Graphics (ToG), vol. 39, no. 4, pp. 71–1, 2020

  6. [6]

    Low power depth estimation of rigid objects for time-of-flight imaging,

    J. Noraky and V . Sze, “Low power depth estimation of rigid objects for time-of-flight imaging,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 6, pp. 1524–1534, 2019

  7. [7]

    Adding conditional control to text-to-image diffusion models,

    L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 3836–3847

  8. [8]

    Structure and content-guided video synthesis with diffusion models,

    P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7346–7356

  9. [9]

    Depth anything: Unleashing the power of large-scale unlabeled data,

    L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 371–10 381

  10. [10]

    Vision transformers for dense prediction,

    R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021, pp. 12 179–12 188

  11. [11]

    ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. M ¨uller, “Zoedepth: Zero-shot transfer by combining relative and metric depth,” arXiv preprint arXiv:2302.12288, 2023

  12. [12]

    Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,

    A. Eftekhar, A. Sax, J. Malik, and A. Zamir, “Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 786–10 796

  13. [13]

    Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,

    R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V . Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 3, pp. 1623–1637, 2020

  14. [14]

    Diversedepth: Affine-invariant depth prediction using diverse data,

    W. Yin, X. Wang, C. Shen, Y . Liu, Z. Tian, S. Xu, C. Sun, and D. Renyin, “Diversedepth: Affine-invariant depth prediction using diverse data,” arXiv preprint arXiv:2002.00569 , 2020

  15. [15]

    Scaledepth: Decomposing metric depth estimation into scale prediction and relative depth estimation,

    R. Zhu, C. Wang, Z. Song, L. Liu, T. Zhang, and Y . Zhang, “Scaledepth: Decomposing metric depth estimation into scale prediction and relative depth estimation,” arXiv preprint arXiv:2407.08187 , 2024

  16. [16]

    Repurposing diffusion-based image generators for monoc- ular depth estimation,

    B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monoc- ular depth estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 9492–9502

  17. [17]

    Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,

    X. Fu, W. Yin, M. Hu, K. Wang, Y . Ma, P. Tan, S. Shen, D. Lin, and X. Long, “Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,” in European Conference on Computer Vision. Springer, 2025, pp. 241–258

  18. [18]

    Depthfm: Fast monocular depth estimation with flow matching,

    M. Gui, J. Schusterbauer, U. Prestel, P. Ma, D. Kotovenko, O. Grebenkova, S. A. Baumann, V . T. Hu, and B. Ommer, “Depthfm: Fast monocular depth estimation with flow matching,” 2024

  19. [19]

    What matters when repurposing diffusion models for general dense perception tasks?

    G. Xu, Y . Ge, M. Liu, C. Fan, K. Xie, Z. Zhao, H. Chen, and C. Shen, “What matters when repurposing diffusion models for general dense perception tasks?” arXiv preprint arXiv:2403.06090 , 2024

  20. [20]

    Lotus: Diffusion-based visual foundation model for high-quality dense prediction

    J. He, H. Li, W. Yin, Y . Liang, L. Li, K. Zhou, H. Liu, B. Liu, and Y .-C. Chen, “Lotus: Diffusion-based visual foundation model for high-quality dense prediction,” arXiv preprint arXiv:2409.18124 , 2024

  21. [21]

    High- resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

  22. [22]

    Scaling rectified flow transformers for high-resolution image synthesis,

    P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel et al. , “Scaling rectified flow transformers for high-resolution image synthesis,” in Forty-first International Conference on Machine Learning , 2024

  23. [23]

    Deep unsupervised learning using nonequilibrium thermodynamics,

    J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning . PMLR, 2015, pp. 2256–2265

  24. [24]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Y . Guo, C. Yang, A. Rao, Z. Liang, Y . Wang, Y . Qiao, M. Agrawala, D. Lin, and B. Dai, “Animatediff: Animate your personalized text- to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725, 2023

  25. [25]

    Animate anyone: Consistent and controllable image-to-video synthesis for character animation,

    L. Hu, “Animate anyone: Consistent and controllable image-to-video synthesis for character animation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 8153–8163

  26. [26]

    Smartbrush: Text and shape guided object inpainting with diffusion model,

    S. Xie, Z. Zhang, Z. Lin, T. Hinz, and K. Zhang, “Smartbrush: Text and shape guided object inpainting with diffusion model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 428–22 437

  27. [27]

    Hd-painter: high-resolution and prompt-faithful text-guided im- age inpainting with diffusion models,

    H. Manukyan, A. Sargsyan, B. Atanyan, Z. Wang, S. Navasardyan, and H. Shi, “Hd-painter: high-resolution and prompt-faithful text-guided im- age inpainting with diffusion models,” arXiv preprint arXiv:2312.14091, 2023

  28. [28]

    Srdiff: Single image super-resolution with diffusion probabilistic mod- els,

    H. Li, Y . Yang, M. Chang, S. Chen, H. Feng, Z. Xu, Q. Li, and Y . Chen, “Srdiff: Single image super-resolution with diffusion probabilistic mod- els,” Neurocomputing, vol. 479, pp. 47–59, 2022

  29. [29]

    Exploiting diffusion prior for real-world image super-resolution,

    J. Wang, Z. Yue, S. Zhou, K. C. Chan, and C. C. Loy, “Exploiting diffusion prior for real-world image super-resolution,” International Journal of Computer Vision , pp. 1–21, 2024

  30. [30]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems , vol. 33, pp. 6840– 6851, 2020

  31. [31]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differ- ential equations,” arXiv preprint arXiv:2011.13456 , 2020

  32. [32]

    On fast sampling of diffusion probabilistic models,

    Z. Kong and W. Ping, “On fast sampling of diffusion probabilistic models,” arXiv preprint arXiv:2106.00132 , 2021

  33. [33]

    Noise estim ation for generative diffusion models

    R. San-Roman, E. Nachmani, and L. Wolf, “Noise estimation for generative diffusion models,” arXiv preprint arXiv:2104.02600 , 2021

  34. [34]

    Denoising Diffusion Implicit Models

    J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502 , 2020

  35. [35]

    Cascaded diffusion models for high fidelity image generation,

    J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, “Cascaded diffusion models for high fidelity image generation,” Journal of Machine Learning Research , vol. 23, no. 47, pp. 1–33, 2022

  36. [36]

    Score-based generative modeling in latent space,

    A. Vahdat, K. Kreis, and J. Kautz, “Score-based generative modeling in latent space,” Advances in neural information processing systems , vol. 34, pp. 11 287–11 302, 2021

  37. [37]

    LAION-5b: An open large- scale dataset for training next generation image-text models,

    C. Schuhmann, R. Beaumont, R. Vencu, C. W. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. R. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, and J. Jitsev, “LAION-5b: An open large- scale dataset for training next generation image-text models,” in Thirty-sixth Conference on Neural Information Processing...

  38. [38]

    A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27,

    Y . LeCun, “A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27,” Open Review, vol. 62, no. 1, pp. 1–62, 2022

  39. [39]

    Self-supervised learning from images with a joint-embedding predictive architecture,

    M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rab- bat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 15 619–15 629

  40. [40]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    S. Yu, S. Kwak, H. Jang, J. Jeong, J. Huang, J. Shin, and S. Xie, “Rep- resentation alignment for generation: Training diffusion transformers is easier than you think,” arXiv preprint arXiv:2410.06940 , 2024

  41. [41]

    Vision meets robotics: The kitti dataset,

    A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” International Journal of Robotics Research (IJRR) , 2013

  42. [42]

    A naturalistic open source movie for optical flow evaluation,

    D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” in Proceedings of the European Conference on Computer Vision (ECCV) , ser. Part IV , LNCS 7577, Oct. 2012, pp. 611–625

  43. [43]

    Sun rgb-d: A rgb-d scene under- standing benchmark suite,

    S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene under- standing benchmark suite,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2015, pp. 567–576. 11

  44. [44]

    Indoor segmen- tation and support inference from rgbd images,

    P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor segmen- tation and support inference from rgbd images,” in Proceedings of the European Conference on Computer Vision (ECCV) , 2012

  45. [45]

    Cornet: Context-based ordinal regression network for monocular depth estimation,

    X. Meng, C. Fan, Y . Ming, and H. Yu, “Cornet: Context-based ordinal regression network for monocular depth estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 7, pp. 4841– 4853, 2021

  46. [46]

    Depth map prediction from a single image using a multi-scale deep network,

    D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” Advances in neural information processing systems , vol. 27, 2014

  47. [47]

    Deeper depth prediction with fully convolutional residual networks,

    I. Laina, C. Rupprecht, V . Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in 2016 Fourth international conference on 3D vision (3DV) . IEEE, 2016, pp. 239–248

  48. [48]

    Squeeze-and-excitation networks,

    J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141

  49. [49]

    Predicting depth, surface normals and se- mantic labels with a common multi-scale convolutional architecture,

    D. Eigen and R. Fergus, “Predicting depth, surface normals and se- mantic labels with a common multi-scale convolutional architecture,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2015, pp. 2650–2658

  50. [50]

    Web stereo video super- vision for depth prediction from dynamic scenes,

    C. Wang, S. Lucey, F. Perazzi, and O. Wang, “Web stereo video super- vision for depth prediction from dynamic scenes,” in 2019 International Conference on 3D Vision (3DV) . IEEE, 2019, pp. 348–357

  51. [51]

    Monocular depth estimation using laplacian pyramid-based depth residuals,

    M. Song, S. Lim, and W. Kim, “Monocular depth estimation using laplacian pyramid-based depth residuals,” IEEE transactions on circuits and systems for video technology , vol. 31, no. 11, pp. 4381–4393, 2021

  52. [52]

    New crfs: Neural window fully-connected crfs for monocular depth estimation,

    W. Yuan, X. Gu, Z. Dai, S. Zhu, and P. Tan, “New crfs: Neural window fully-connected crfs for monocular depth estimation,” arXiv preprint arXiv:2203.01502, 2022

  53. [53]

    Unleashing text-to-image diffusion models for visual perception,

    W. Zhao, Y . Rao, Z. Liu, B. Liu, J. Zhou, and J. Lu, “Unleashing text-to-image diffusion models for visual perception,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 5729–5739

  54. [54]

    Ecodepth: Effective conditioning of diffusion models for monocular depth estimation,

    S. Patni, A. Agarwal, and C. Arora, “Ecodepth: Effective conditioning of diffusion models for monocular depth estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28 285–28 295

  55. [55]

    Estimating depth from monocular images as classification using deep fully convolutional residual networks,

    Y . Cao, Z. Wu, and C. Shen, “Estimating depth from monocular images as classification using deep fully convolutional residual networks,” IEEE Transactions on Circuits and Systems for Video Technology , vol. 28, no. 11, pp. 3174–3182, 2017

  56. [56]

    Monocular depth estimation with augmented ordinal depth relationships,

    Y . Cao, T. Zhao, K. Xian, C. Shen, Z. Cao, and S. Xu, “Monocular depth estimation with augmented ordinal depth relationships,” IEEE Transactions on Circuits and Systems for Video Technology , vol. 30, no. 8, pp. 2674–2682, 2019

  57. [57]

    Adabins: Depth estimation using adaptive bins,

    S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021, pp. 4009–4018

  58. [58]

    Binsformer: Revisiting adaptive bins for monocular depth estimation,

    Z. Li, X. Wang, X. Liu, and J. Jiang, “Binsformer: Revisiting adaptive bins for monocular depth estimation,” arXiv preprint arXiv:2204.00987, 2022

  59. [59]

    Ha-bins: Hierarchical adaptive bins for robust monocular depth estimation across multiple datasets,

    R. Zhu, Z. Song, L. Liu, J. He, T. Zhang, and Y . Zhang, “Ha-bins: Hierarchical adaptive bins for robust monocular depth estimation across multiple datasets,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 6, pp. 4354–4366, 2024

  60. [60]

    Enforcing geometric constraints of virtual normal for depth prediction,

    W. Yin, Y . Liu, C. Shen, and Y . Yan, “Enforcing geometric constraints of virtual normal for depth prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2019, pp. 5684–5693

  61. [61]

    Ec-depth: Exploring the consistency of self-supervised monocular depth estimation under challenging scenes,

    R. Zhu, Z. Song, C. Wang, J. He, and T. Zhang, “Ec-depth: Exploring the consistency of self-supervised monocular depth estimation under challenging scenes,” arXiv preprint arXiv:2310.08044 , 2023

  62. [62]

    Geonet: Geometric neural network for joint depth and surface normal estimation,

    X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia, “Geonet: Geometric neural network for joint depth and surface normal estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 283–291

  63. [63]

    Plane2depth: Hierarchical adaptive plane guidance for monocular depth estimation,

    L. Liu, R. Zhu, J. Deng, Z. Song, W. Yang, and T. Zhang, “Plane2depth: Hierarchical adaptive plane guidance for monocular depth estimation,” IEEE Transactions on Circuits and Systems for Video Technology, 2024

  64. [64]

    Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing,

    D. Xu, W. Ouyang, X. Wang, and N. Sebe, “Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2018, pp. 675–684

  65. [65]

    Towards scene understanding: Unsupervised monocular depth estimation with semantic- aware representation,

    P.-Y . Chen, A. H. Liu, Y .-C. Liu, and Y .-C. F. Wang, “Towards scene understanding: Unsupervised monocular depth estimation with semantic- aware representation,” in Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition , 2019, pp. 2624–2632

  66. [66]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al. , “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023

  67. [67]

    Diffusionedge: Diffusion probabilistic model for crisp edge detection,

    Y . Ye, K. Xu, Y . Huang, R. Yi, and Z. Cai, “Diffusionedge: Diffusion probabilistic model for crisp edge detection,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 38, no. 7, 2024, pp. 6675– 6683

  68. [68]

    Robust estimation of a location parameter,

    P. J. Huber, “Robust estimation of a location parameter,” in Break- throughs in statistics: Methodology and distribution . Springer, 1992, pp. 492–518

  69. [69]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101 , 2017

  70. [70]

    Learning to recover 3d scene shape from a single image,

    W. Yin, J. Zhang, O. Wang, S. Niklaus, L. Mai, S. Chen, and C. Shen, “Learning to recover 3d scene shape from a single image,” in Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 204–213

  71. [71]

    Hierarchical normalization for robust monocular depth estimation,

    C. Zhang, W. Yin, B. Wang, G. Yu, B. Fu, and C. Shen, “Hierarchical normalization for robust monocular depth estimation,” Advances in Neural Information Processing Systems , vol. 35, pp. 14 128–14 139, 2022

  72. [72]

    Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,

    M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind, “Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 912–10 922

  73. [73]

    Virtual KITTI 2

    Y . Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,” arXiv preprint arXiv:2001.10773, 2020

  74. [74]

    Vision meets robotics: The kitti dataset,

    A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research , vol. 32, no. 11, pp. 1231–1237, 2013

  75. [75]

    Indoor segmentation and support inference from rgbd images,

    N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12 . Springer, 2012, pp. 746– 760

  76. [76]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes,

    A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839

  77. [77]

    A multi-view stereo benchmark with high- resolution images and multi-camera videos,

    T. Schops, J. L. Schonberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger, “A multi-view stereo benchmark with high- resolution images and multi-camera videos,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3260– 3269

  78. [78]

    Diode: A dense indoor and outdoor depth dataset,

    I. Vasiljevic, N. Kolkin, S. Zhang, R. Luo, H. Wang, F. Z. Dai, A. F. Daniele, M. Mostajabi, S. Basart, M. R. Walter et al., “Diode: A dense indoor and outdoor depth dataset,” arXiv preprint arXiv:1908.00463 , 2019

  79. [79]

    Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

    A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y . Zhou, S. R. Richter, and V . Koltun, “Depth pro: Sharp monocular metric depth in less than a second,” arXiv preprint arXiv:2410.02073 , 2024

  80. [80]

    Reproducible scaling laws for contrastive language-image learning,

    M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Reproducible scaling laws for contrastive language-image learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2818–2829

Showing first 80 references.