DepthMaster: Taming Diffusion Models for Monocular Depth Estimation
Pith reviewed 2026-05-23 06:05 UTC · model grok-4.3
The pith
Single-step diffusion models can match multi-step accuracy in monocular depth estimation once generative features are aligned with discriminative needs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DepthMaster is a single-step deterministic diffusion model that adapts generative features for the discriminative task of monocular depth estimation. A Feature Alignment module incorporates high-quality semantic features to strengthen the denoising network and reduce overfitting to texture details. A Fourier Enhancement module adaptively balances low-frequency scene structure against high-frequency details. Two-stage training first emphasizes global structure via the alignment module, then refines visual quality via the Fourier module, yielding state-of-the-art generalization and detail preservation that surpasses other diffusion-based depth estimators across datasets.
What carries the argument
Feature Alignment module (injects semantic features into the denoising network) paired with Fourier Enhancement module (balances frequency content), trained in two sequential stages.
If this is right
- Single denoising step becomes sufficient for competitive depth maps instead of requiring dozens of steps.
- Depth estimates improve on unseen datasets without per-dataset retraining or fine-tuning.
- Fine details in depth maps are recovered without sacrificing global scene consistency.
- The same single-step diffusion backbone can be reused for other dense prediction tasks once the alignment and frequency modules are added.
Where Pith is reading between the lines
- The two-stage training pattern may serve as a general recipe for converting generative diffusion backbones into fast discriminative predictors.
- If the feature mismatch is the dominant issue, similar alignment techniques could improve single-step diffusion for related tasks such as surface normal estimation or semantic segmentation.
- The Fourier module's explicit frequency control suggests a route to diagnose and correct other failure modes where diffusion models lose high-frequency information under aggressive step reduction.
Load-bearing premise
The performance gap in single-step diffusion depth estimation is caused mainly by a mismatch between generative and discriminative features that these two modules can close without introducing overfitting or detail loss.
What would settle it
A controlled test in which removing either the Feature Alignment or Fourier Enhancement module produces no measurable drop in accuracy or detail on held-out datasets would falsify the claim that these modules are the effective fix.
Figures
read the original abstract
Monocular depth estimation within the diffusion-denoising paradigm demonstrates impressive generalization ability but suffers from low inference speed. Recent methods adopt a single-step deterministic paradigm to improve inference efficiency while maintaining comparable performance. However, they overlook the gap between generative and discriminative features, leading to suboptimal results. In this work, we propose DepthMaster, a single-step diffusion model designed to adapt generative features for the discriminative depth estimation task. First, to mitigate overfitting to texture details introduced by generative features, we propose a Feature Alignment module, which incorporates high-quality semantic features to enhance the denoising network's representation capability. Second, to address the lack of fine-grained details in the single-step deterministic framework, we propose a Fourier Enhancement module to adaptively balance low-frequency structure and high-frequency details. We adopt a two-stage training strategy to fully leverage the potential of the two modules. In the first stage, we focus on learning the global scene structure with the Feature Alignment module, while in the second stage, we exploit the Fourier Enhancement module to improve the visual quality. Through these efforts, our model achieves state-of-the-art performance in terms of generalization and detail preservation, outperforming other diffusion-based methods across various datasets. Our project page can be found at https://indu1ge.github.io/DepthMaster_page.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DepthMaster, a single-step deterministic diffusion model for monocular depth estimation. It proposes a Feature Alignment module that incorporates high-quality semantic features to mitigate overfitting to texture details from generative features, and a Fourier Enhancement module that adaptively balances low-frequency structure and high-frequency details. A two-stage training strategy is employed (first focusing on global structure via Feature Alignment, then on visual quality via Fourier Enhancement). The central claim is that these components close the generative-discriminative feature gap, yielding state-of-the-art generalization and detail preservation that outperforms other diffusion-based methods across various datasets.
Significance. If the reported experiments hold, the work would be significant for adapting diffusion models to discriminative tasks under single-step inference constraints. The modular design (Feature Alignment + Fourier Enhancement) and staged training provide a concrete, testable approach to the generative-discriminative mismatch that prior single-step methods overlook, with potential impact on efficient depth estimation pipelines.
major comments (1)
- [Abstract] Abstract: The assertion of 'state-of-the-art performance in terms of generalization and detail preservation' and 'outperforming other diffusion-based methods across various datasets' is presented with no quantitative metrics, baselines, error analysis, dataset names, or table references. This is load-bearing for the central claim and creates a verification gap that must be addressed with explicit results.
minor comments (2)
- The description of the two-stage training strategy would benefit from explicit loss formulations or pseudocode to clarify how the modules are activated or frozen across stages.
- Notation for the Fourier Enhancement module (e.g., frequency-domain operations) should be defined with equations rather than prose alone to ensure reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on the abstract. We agree that the central claims require explicit support and will revise the abstract to address the verification gap while preserving conciseness.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion of 'state-of-the-art performance in terms of generalization and detail preservation' and 'outperforming other diffusion-based methods across various datasets' is presented with no quantitative metrics, baselines, error analysis, dataset names, or table references. This is load-bearing for the central claim and creates a verification gap that must be addressed with explicit results.
Authors: We agree that the abstract would be strengthened by including explicit quantitative support. In the revised version we will add concise references to key metrics (e.g., AbsRel, RMSE on NYU Depth V2, KITTI, and ETH3D), the main baselines, and the corresponding result tables/figures. This directly addresses the verification gap without expanding the abstract beyond typical length constraints. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces two new modules (Feature Alignment, Fourier Enhancement) plus a two-stage training procedure as design choices to adapt generative features for discriminative depth estimation. No equations, uniqueness theorems, or first-principles derivations are claimed; performance claims rest on the proposed architecture and reported experiments rather than any reduction to fitted parameters or self-citation chains. The provided abstract and skeptic analysis confirm the argument is internally consistent without the enumerated circularity patterns (self-definitional, fitted-input prediction, load-bearing self-citation, etc.).
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
The Midas Touch for Metric Depth
MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.
Reference graph
Works this paper leans on
-
[1]
Mgnet: Monocular geo- metric scene understanding for autonomous driving,
M. Sch ¨on, M. Buchholz, and K. Dietmayer, “Mgnet: Monocular geo- metric scene understanding for autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021, pp. 15 804–15 815
work page 2021
-
[2]
Y . Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8445–8453
work page 2019
-
[3]
Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving,
Y . You, Y . Wang, W.-L. Chao, D. Garg, G. Pleiss, B. Hariha- ran, M. Campbell, and K. Q. Weinberger, “Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving,” arXiv preprint arXiv:1906.06310, 2019
-
[4]
Ro- bodepth: Robust out-of-distribution depth estimation under corruptions,
L. Kong, S. Xie, H. Hu, B. Cottereau, L. X. Ng, and W. T. Ooi, “Ro- bodepth: Robust out-of-distribution depth estimation under corruptions,” arXiv preprint arXiv:23xx.xxxxx , 2023
work page 2023
-
[5]
Consistent video depth estimation,
X. Luo, J.-B. Huang, R. Szeliski, K. Matzen, and J. Kopf, “Consistent video depth estimation,” ACM Transactions on Graphics (ToG), vol. 39, no. 4, pp. 71–1, 2020
work page 2020
-
[6]
Low power depth estimation of rigid objects for time-of-flight imaging,
J. Noraky and V . Sze, “Low power depth estimation of rigid objects for time-of-flight imaging,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 6, pp. 1524–1534, 2019
work page 2019
-
[7]
Adding conditional control to text-to-image diffusion models,
L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 3836–3847
work page 2023
-
[8]
Structure and content-guided video synthesis with diffusion models,
P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7346–7356
work page 2023
-
[9]
Depth anything: Unleashing the power of large-scale unlabeled data,
L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 371–10 381
work page 2024
-
[10]
Vision transformers for dense prediction,
R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021, pp. 12 179–12 188
work page 2021
-
[11]
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. M ¨uller, “Zoedepth: Zero-shot transfer by combining relative and metric depth,” arXiv preprint arXiv:2302.12288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,
A. Eftekhar, A. Sax, J. Malik, and A. Zamir, “Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 786–10 796
work page 2021
-
[13]
Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,
R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V . Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 3, pp. 1623–1637, 2020
work page 2020
-
[14]
Diversedepth: Affine-invariant depth prediction using diverse data,
W. Yin, X. Wang, C. Shen, Y . Liu, Z. Tian, S. Xu, C. Sun, and D. Renyin, “Diversedepth: Affine-invariant depth prediction using diverse data,” arXiv preprint arXiv:2002.00569 , 2020
-
[15]
Scaledepth: Decomposing metric depth estimation into scale prediction and relative depth estimation,
R. Zhu, C. Wang, Z. Song, L. Liu, T. Zhang, and Y . Zhang, “Scaledepth: Decomposing metric depth estimation into scale prediction and relative depth estimation,” arXiv preprint arXiv:2407.08187 , 2024
-
[16]
Repurposing diffusion-based image generators for monoc- ular depth estimation,
B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monoc- ular depth estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 9492–9502
work page 2024
-
[17]
Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,
X. Fu, W. Yin, M. Hu, K. Wang, Y . Ma, P. Tan, S. Shen, D. Lin, and X. Long, “Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,” in European Conference on Computer Vision. Springer, 2025, pp. 241–258
work page 2025
-
[18]
Depthfm: Fast monocular depth estimation with flow matching,
M. Gui, J. Schusterbauer, U. Prestel, P. Ma, D. Kotovenko, O. Grebenkova, S. A. Baumann, V . T. Hu, and B. Ommer, “Depthfm: Fast monocular depth estimation with flow matching,” 2024
work page 2024
-
[19]
What matters when repurposing diffusion models for general dense perception tasks?
G. Xu, Y . Ge, M. Liu, C. Fan, K. Xie, Z. Zhao, H. Chen, and C. Shen, “What matters when repurposing diffusion models for general dense perception tasks?” arXiv preprint arXiv:2403.06090 , 2024
-
[20]
Lotus: Diffusion-based visual foundation model for high-quality dense prediction
J. He, H. Li, W. Yin, Y . Liang, L. Li, K. Zhou, H. Liu, B. Liu, and Y .-C. Chen, “Lotus: Diffusion-based visual foundation model for high-quality dense prediction,” arXiv preprint arXiv:2409.18124 , 2024
-
[21]
High- resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695
work page 2022
-
[22]
Scaling rectified flow transformers for high-resolution image synthesis,
P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel et al. , “Scaling rectified flow transformers for high-resolution image synthesis,” in Forty-first International Conference on Machine Learning , 2024
work page 2024
-
[23]
Deep unsupervised learning using nonequilibrium thermodynamics,
J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning . PMLR, 2015, pp. 2256–2265
work page 2015
-
[24]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Y . Guo, C. Yang, A. Rao, Z. Liang, Y . Wang, Y . Qiao, M. Agrawala, D. Lin, and B. Dai, “Animatediff: Animate your personalized text- to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Animate anyone: Consistent and controllable image-to-video synthesis for character animation,
L. Hu, “Animate anyone: Consistent and controllable image-to-video synthesis for character animation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 8153–8163
work page 2024
-
[26]
Smartbrush: Text and shape guided object inpainting with diffusion model,
S. Xie, Z. Zhang, Z. Lin, T. Hinz, and K. Zhang, “Smartbrush: Text and shape guided object inpainting with diffusion model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 428–22 437
work page 2023
-
[27]
H. Manukyan, A. Sargsyan, B. Atanyan, Z. Wang, S. Navasardyan, and H. Shi, “Hd-painter: high-resolution and prompt-faithful text-guided im- age inpainting with diffusion models,” arXiv preprint arXiv:2312.14091, 2023
-
[28]
Srdiff: Single image super-resolution with diffusion probabilistic mod- els,
H. Li, Y . Yang, M. Chang, S. Chen, H. Feng, Z. Xu, Q. Li, and Y . Chen, “Srdiff: Single image super-resolution with diffusion probabilistic mod- els,” Neurocomputing, vol. 479, pp. 47–59, 2022
work page 2022
-
[29]
Exploiting diffusion prior for real-world image super-resolution,
J. Wang, Z. Yue, S. Zhou, K. C. Chan, and C. C. Loy, “Exploiting diffusion prior for real-world image super-resolution,” International Journal of Computer Vision , pp. 1–21, 2024
work page 2024
-
[30]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems , vol. 33, pp. 6840– 6851, 2020
work page 2020
-
[31]
Score-Based Generative Modeling through Stochastic Differential Equations
Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differ- ential equations,” arXiv preprint arXiv:2011.13456 , 2020
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[32]
On fast sampling of diffusion probabilistic models,
Z. Kong and W. Ping, “On fast sampling of diffusion probabilistic models,” arXiv preprint arXiv:2106.00132 , 2021
-
[33]
Noise estim ation for generative diffusion models
R. San-Roman, E. Nachmani, and L. Wolf, “Noise estimation for generative diffusion models,” arXiv preprint arXiv:2104.02600 , 2021
-
[34]
Denoising Diffusion Implicit Models
J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502 , 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[35]
Cascaded diffusion models for high fidelity image generation,
J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, “Cascaded diffusion models for high fidelity image generation,” Journal of Machine Learning Research , vol. 23, no. 47, pp. 1–33, 2022
work page 2022
-
[36]
Score-based generative modeling in latent space,
A. Vahdat, K. Kreis, and J. Kautz, “Score-based generative modeling in latent space,” Advances in neural information processing systems , vol. 34, pp. 11 287–11 302, 2021
work page 2021
-
[37]
LAION-5b: An open large- scale dataset for training next generation image-text models,
C. Schuhmann, R. Beaumont, R. Vencu, C. W. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. R. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, and J. Jitsev, “LAION-5b: An open large- scale dataset for training next generation image-text models,” in Thirty-sixth Conference on Neural Information Processing...
work page 2022
-
[38]
A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27,
Y . LeCun, “A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27,” Open Review, vol. 62, no. 1, pp. 1–62, 2022
work page 2022
-
[39]
Self-supervised learning from images with a joint-embedding predictive architecture,
M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rab- bat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 15 619–15 629
work page 2023
-
[40]
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
S. Yu, S. Kwak, H. Jang, J. Jeong, J. Huang, J. Shin, and S. Xie, “Rep- resentation alignment for generation: Training diffusion transformers is easier than you think,” arXiv preprint arXiv:2410.06940 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Vision meets robotics: The kitti dataset,
A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” International Journal of Robotics Research (IJRR) , 2013
work page 2013
-
[42]
A naturalistic open source movie for optical flow evaluation,
D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” in Proceedings of the European Conference on Computer Vision (ECCV) , ser. Part IV , LNCS 7577, Oct. 2012, pp. 611–625
work page 2012
-
[43]
Sun rgb-d: A rgb-d scene under- standing benchmark suite,
S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene under- standing benchmark suite,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2015, pp. 567–576. 11
work page 2015
-
[44]
Indoor segmen- tation and support inference from rgbd images,
P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor segmen- tation and support inference from rgbd images,” in Proceedings of the European Conference on Computer Vision (ECCV) , 2012
work page 2012
-
[45]
Cornet: Context-based ordinal regression network for monocular depth estimation,
X. Meng, C. Fan, Y . Ming, and H. Yu, “Cornet: Context-based ordinal regression network for monocular depth estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 7, pp. 4841– 4853, 2021
work page 2021
-
[46]
Depth map prediction from a single image using a multi-scale deep network,
D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” Advances in neural information processing systems , vol. 27, 2014
work page 2014
-
[47]
Deeper depth prediction with fully convolutional residual networks,
I. Laina, C. Rupprecht, V . Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in 2016 Fourth international conference on 3D vision (3DV) . IEEE, 2016, pp. 239–248
work page 2016
-
[48]
Squeeze-and-excitation networks,
J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141
work page 2018
-
[49]
D. Eigen and R. Fergus, “Predicting depth, surface normals and se- mantic labels with a common multi-scale convolutional architecture,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2015, pp. 2650–2658
work page 2015
-
[50]
Web stereo video super- vision for depth prediction from dynamic scenes,
C. Wang, S. Lucey, F. Perazzi, and O. Wang, “Web stereo video super- vision for depth prediction from dynamic scenes,” in 2019 International Conference on 3D Vision (3DV) . IEEE, 2019, pp. 348–357
work page 2019
-
[51]
Monocular depth estimation using laplacian pyramid-based depth residuals,
M. Song, S. Lim, and W. Kim, “Monocular depth estimation using laplacian pyramid-based depth residuals,” IEEE transactions on circuits and systems for video technology , vol. 31, no. 11, pp. 4381–4393, 2021
work page 2021
-
[52]
New crfs: Neural window fully-connected crfs for monocular depth estimation,
W. Yuan, X. Gu, Z. Dai, S. Zhu, and P. Tan, “New crfs: Neural window fully-connected crfs for monocular depth estimation,” arXiv preprint arXiv:2203.01502, 2022
-
[53]
Unleashing text-to-image diffusion models for visual perception,
W. Zhao, Y . Rao, Z. Liu, B. Liu, J. Zhou, and J. Lu, “Unleashing text-to-image diffusion models for visual perception,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 5729–5739
work page 2023
-
[54]
Ecodepth: Effective conditioning of diffusion models for monocular depth estimation,
S. Patni, A. Agarwal, and C. Arora, “Ecodepth: Effective conditioning of diffusion models for monocular depth estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28 285–28 295
work page 2024
-
[55]
Y . Cao, Z. Wu, and C. Shen, “Estimating depth from monocular images as classification using deep fully convolutional residual networks,” IEEE Transactions on Circuits and Systems for Video Technology , vol. 28, no. 11, pp. 3174–3182, 2017
work page 2017
-
[56]
Monocular depth estimation with augmented ordinal depth relationships,
Y . Cao, T. Zhao, K. Xian, C. Shen, Z. Cao, and S. Xu, “Monocular depth estimation with augmented ordinal depth relationships,” IEEE Transactions on Circuits and Systems for Video Technology , vol. 30, no. 8, pp. 2674–2682, 2019
work page 2019
-
[57]
Adabins: Depth estimation using adaptive bins,
S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021, pp. 4009–4018
work page 2021
-
[58]
Binsformer: Revisiting adaptive bins for monocular depth estimation,
Z. Li, X. Wang, X. Liu, and J. Jiang, “Binsformer: Revisiting adaptive bins for monocular depth estimation,” arXiv preprint arXiv:2204.00987, 2022
-
[59]
Ha-bins: Hierarchical adaptive bins for robust monocular depth estimation across multiple datasets,
R. Zhu, Z. Song, L. Liu, J. He, T. Zhang, and Y . Zhang, “Ha-bins: Hierarchical adaptive bins for robust monocular depth estimation across multiple datasets,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 6, pp. 4354–4366, 2024
work page 2024
-
[60]
Enforcing geometric constraints of virtual normal for depth prediction,
W. Yin, Y . Liu, C. Shen, and Y . Yan, “Enforcing geometric constraints of virtual normal for depth prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2019, pp. 5684–5693
work page 2019
-
[61]
R. Zhu, Z. Song, C. Wang, J. He, and T. Zhang, “Ec-depth: Exploring the consistency of self-supervised monocular depth estimation under challenging scenes,” arXiv preprint arXiv:2310.08044 , 2023
-
[62]
Geonet: Geometric neural network for joint depth and surface normal estimation,
X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia, “Geonet: Geometric neural network for joint depth and surface normal estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 283–291
work page 2018
-
[63]
Plane2depth: Hierarchical adaptive plane guidance for monocular depth estimation,
L. Liu, R. Zhu, J. Deng, Z. Song, W. Yang, and T. Zhang, “Plane2depth: Hierarchical adaptive plane guidance for monocular depth estimation,” IEEE Transactions on Circuits and Systems for Video Technology, 2024
work page 2024
-
[64]
D. Xu, W. Ouyang, X. Wang, and N. Sebe, “Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2018, pp. 675–684
work page 2018
-
[65]
P.-Y . Chen, A. H. Liu, Y .-C. Liu, and Y .-C. F. Wang, “Towards scene understanding: Unsupervised monocular depth estimation with semantic- aware representation,” in Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition , 2019, pp. 2624–2632
work page 2019
-
[66]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al. , “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
Diffusionedge: Diffusion probabilistic model for crisp edge detection,
Y . Ye, K. Xu, Y . Huang, R. Yi, and Z. Cai, “Diffusionedge: Diffusion probabilistic model for crisp edge detection,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 38, no. 7, 2024, pp. 6675– 6683
work page 2024
-
[68]
Robust estimation of a location parameter,
P. J. Huber, “Robust estimation of a location parameter,” in Break- throughs in statistics: Methodology and distribution . Springer, 1992, pp. 492–518
work page 1992
-
[69]
Decoupled Weight Decay Regularization
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[70]
Learning to recover 3d scene shape from a single image,
W. Yin, J. Zhang, O. Wang, S. Niklaus, L. Mai, S. Chen, and C. Shen, “Learning to recover 3d scene shape from a single image,” in Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 204–213
work page 2021
-
[71]
Hierarchical normalization for robust monocular depth estimation,
C. Zhang, W. Yin, B. Wang, G. Yu, B. Fu, and C. Shen, “Hierarchical normalization for robust monocular depth estimation,” Advances in Neural Information Processing Systems , vol. 35, pp. 14 128–14 139, 2022
work page 2022
-
[72]
Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,
M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind, “Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 912–10 922
work page 2021
-
[73]
Y . Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,” arXiv preprint arXiv:2001.10773, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[74]
Vision meets robotics: The kitti dataset,
A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research , vol. 32, no. 11, pp. 1231–1237, 2013
work page 2013
-
[75]
Indoor segmentation and support inference from rgbd images,
N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12 . Springer, 2012, pp. 746– 760
work page 2012
-
[76]
Scannet: Richly-annotated 3d reconstructions of indoor scenes,
A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839
work page 2017
-
[77]
A multi-view stereo benchmark with high- resolution images and multi-camera videos,
T. Schops, J. L. Schonberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger, “A multi-view stereo benchmark with high- resolution images and multi-camera videos,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3260– 3269
work page 2017
-
[78]
Diode: A dense indoor and outdoor depth dataset,
I. Vasiljevic, N. Kolkin, S. Zhang, R. Luo, H. Wang, F. Z. Dai, A. F. Daniele, M. Mostajabi, S. Basart, M. R. Walter et al., “Diode: A dense indoor and outdoor depth dataset,” arXiv preprint arXiv:1908.00463 , 2019
-
[79]
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y . Zhou, S. R. Richter, and V . Koltun, “Depth pro: Sharp monocular metric depth in less than a second,” arXiv preprint arXiv:2410.02073 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[80]
Reproducible scaling laws for contrastive language-image learning,
M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Reproducible scaling laws for contrastive language-image learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2818–2829
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.