DepthMaster: Taming Diffusion Models for Monocular Depth Estimation

Bo Li; Hao Zhang; Li Liu; Peng-Tao Jiang; Ruijie Zhu; Tianzhu Zhang; Zerong Wang; Ziyang Song

arxiv: 2501.02576 · v2 · submitted 2025-01-05 · 💻 cs.CV

DepthMaster: Taming Diffusion Models for Monocular Depth Estimation

Ziyang Song , Zerong Wang , Bo Li , Hao Zhang , Ruijie Zhu , Li Liu , Peng-Tao Jiang , Tianzhu Zhang This is my paper

Pith reviewed 2026-05-23 06:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords monocular depth estimationdiffusion modelssingle-step denoisingfeature alignmentfourier enhancementtwo-stage traininggeneralization

0 comments

The pith

Single-step diffusion models can match multi-step accuracy in monocular depth estimation once generative features are aligned with discriminative needs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to fix the speed-accuracy tradeoff in diffusion-based monocular depth estimation. Standard diffusion needs many denoising steps for good results, but recent single-step versions lose either generalization or fine detail because the features learned for image generation do not match what depth prediction requires. DepthMaster closes that gap with two modules and a two-stage training schedule: one module injects semantic features to stop overfitting to textures, the other uses Fourier transforms to restore missing high-frequency details while keeping global structure. The result is a model that runs at single-step speed yet reports better generalization and detail preservation than prior diffusion methods on multiple datasets.

Core claim

DepthMaster is a single-step deterministic diffusion model that adapts generative features for the discriminative task of monocular depth estimation. A Feature Alignment module incorporates high-quality semantic features to strengthen the denoising network and reduce overfitting to texture details. A Fourier Enhancement module adaptively balances low-frequency scene structure against high-frequency details. Two-stage training first emphasizes global structure via the alignment module, then refines visual quality via the Fourier module, yielding state-of-the-art generalization and detail preservation that surpasses other diffusion-based depth estimators across datasets.

What carries the argument

Feature Alignment module (injects semantic features into the denoising network) paired with Fourier Enhancement module (balances frequency content), trained in two sequential stages.

If this is right

Single denoising step becomes sufficient for competitive depth maps instead of requiring dozens of steps.
Depth estimates improve on unseen datasets without per-dataset retraining or fine-tuning.
Fine details in depth maps are recovered without sacrificing global scene consistency.
The same single-step diffusion backbone can be reused for other dense prediction tasks once the alignment and frequency modules are added.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The two-stage training pattern may serve as a general recipe for converting generative diffusion backbones into fast discriminative predictors.
If the feature mismatch is the dominant issue, similar alignment techniques could improve single-step diffusion for related tasks such as surface normal estimation or semantic segmentation.
The Fourier module's explicit frequency control suggests a route to diagnose and correct other failure modes where diffusion models lose high-frequency information under aggressive step reduction.

Load-bearing premise

The performance gap in single-step diffusion depth estimation is caused mainly by a mismatch between generative and discriminative features that these two modules can close without introducing overfitting or detail loss.

What would settle it

A controlled test in which removing either the Feature Alignment or Fourier Enhancement module produces no measurable drop in accuracy or detail on held-out datasets would falsify the claim that these modules are the effective fix.

Figures

Figures reproduced from arXiv: 2501.02576 by Bo Li, Hao Zhang, Li Liu, Peng-Tao Jiang, Ruijie Zhu, Tianzhu Zhang, Zerong Wang, Ziyang Song.

**Figure 1.** Figure 1: Visualization of different paradigms. “Denoise” refers to predicting depth in a diffusion-denoising way. Limited by the feature representation capability [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The overall framework of DepthMaster. RGB is first projected into the latent space by the I2L Encoder to obtain [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison with zero-shot monocular depth estimation methods across different datasets. Our model demonstrates excellent detail [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results on in-the-wild examples. Our model not only recovers correct scene structure, but also exhibits fine-grained details. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Depth distribution of different depth preprocess methods on Virtual [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of predictions from two stages. With the Fourier [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Monocular depth estimation within the diffusion-denoising paradigm demonstrates impressive generalization ability but suffers from low inference speed. Recent methods adopt a single-step deterministic paradigm to improve inference efficiency while maintaining comparable performance. However, they overlook the gap between generative and discriminative features, leading to suboptimal results. In this work, we propose DepthMaster, a single-step diffusion model designed to adapt generative features for the discriminative depth estimation task. First, to mitigate overfitting to texture details introduced by generative features, we propose a Feature Alignment module, which incorporates high-quality semantic features to enhance the denoising network's representation capability. Second, to address the lack of fine-grained details in the single-step deterministic framework, we propose a Fourier Enhancement module to adaptively balance low-frequency structure and high-frequency details. We adopt a two-stage training strategy to fully leverage the potential of the two modules. In the first stage, we focus on learning the global scene structure with the Feature Alignment module, while in the second stage, we exploit the Fourier Enhancement module to improve the visual quality. Through these efforts, our model achieves state-of-the-art performance in terms of generalization and detail preservation, outperforming other diffusion-based methods across various datasets. Our project page can be found at https://indu1ge.github.io/DepthMaster_page.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DepthMaster adds Feature Alignment and Fourier Enhancement modules plus two-stage training to single-step diffusion depth estimation, addressing a plausible feature mismatch without obvious logical flaws.

read the letter

The paper takes recent single-step deterministic diffusion approaches for monocular depth and targets the gap between generative features and discriminative needs. It adds a Feature Alignment module to pull in semantic information and reduce texture overfitting, plus a Fourier Enhancement module to balance low-frequency structure against high-frequency details. Training happens in two stages, first locking in global layout then refining visuals. This is a direct, incremental response to the limitations the authors flag in prior work. The design choices line up internally with the stated goals and do not rely on circular derivations or unfalsifiable steps. The modules are described clearly enough that someone could implement the intent. What stands out is the focused engineering: instead of a broad new framework, they isolate two concrete problems and propose targeted fixes with staged optimization. That keeps the contribution narrow but potentially practical for speeding up inference while trying to hold quality. The main soft spot is that the abstract states SOTA generalization and detail preservation across datasets without any metrics, baselines, or error breakdowns. Even with the full text consulted, the strength of the central claim still depends on whether the experiments actually show the modules closing the gap without side effects like detail loss or overfitting. If the paper includes solid ablations and standard benchmark tables, that would make the case much stronger. This is for CV researchers already working on efficient depth estimators or diffusion adaptations for discriminative tasks. A reader looking for practical tweaks to single-step methods could extract usable ideas. It deserves peer review so the empirical results can be checked properly.

Referee Report

1 major / 2 minor

Summary. The paper introduces DepthMaster, a single-step deterministic diffusion model for monocular depth estimation. It proposes a Feature Alignment module that incorporates high-quality semantic features to mitigate overfitting to texture details from generative features, and a Fourier Enhancement module that adaptively balances low-frequency structure and high-frequency details. A two-stage training strategy is employed (first focusing on global structure via Feature Alignment, then on visual quality via Fourier Enhancement). The central claim is that these components close the generative-discriminative feature gap, yielding state-of-the-art generalization and detail preservation that outperforms other diffusion-based methods across various datasets.

Significance. If the reported experiments hold, the work would be significant for adapting diffusion models to discriminative tasks under single-step inference constraints. The modular design (Feature Alignment + Fourier Enhancement) and staged training provide a concrete, testable approach to the generative-discriminative mismatch that prior single-step methods overlook, with potential impact on efficient depth estimation pipelines.

major comments (1)

[Abstract] Abstract: The assertion of 'state-of-the-art performance in terms of generalization and detail preservation' and 'outperforming other diffusion-based methods across various datasets' is presented with no quantitative metrics, baselines, error analysis, dataset names, or table references. This is load-bearing for the central claim and creates a verification gap that must be addressed with explicit results.

minor comments (2)

The description of the two-stage training strategy would benefit from explicit loss formulations or pseudocode to clarify how the modules are activated or frozen across stages.
Notation for the Fourier Enhancement module (e.g., frequency-domain operations) should be defined with equations rather than prose alone to ensure reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract. We agree that the central claims require explicit support and will revise the abstract to address the verification gap while preserving conciseness.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion of 'state-of-the-art performance in terms of generalization and detail preservation' and 'outperforming other diffusion-based methods across various datasets' is presented with no quantitative metrics, baselines, error analysis, dataset names, or table references. This is load-bearing for the central claim and creates a verification gap that must be addressed with explicit results.

Authors: We agree that the abstract would be strengthened by including explicit quantitative support. In the revised version we will add concise references to key metrics (e.g., AbsRel, RMSE on NYU Depth V2, KITTI, and ETH3D), the main baselines, and the corresponding result tables/figures. This directly addresses the verification gap without expanding the abstract beyond typical length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces two new modules (Feature Alignment, Fourier Enhancement) plus a two-stage training procedure as design choices to adapt generative features for discriminative depth estimation. No equations, uniqueness theorems, or first-principles derivations are claimed; performance claims rest on the proposed architecture and reported experiments rather than any reduction to fitted parameters or self-citation chains. The provided abstract and skeptic analysis confirm the argument is internally consistent without the enumerated circularity patterns (self-definitional, fitted-input prediction, load-bearing self-citation, etc.).

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, datasets, or implementation details; no free parameters, axioms, or invented entities can be identified or audited.

pith-pipeline@v0.9.0 · 5778 in / 1171 out tokens · 38499 ms · 2026-05-23T06:05:52.149573+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Midas Touch for Metric Depth
cs.CV 2026-05 unverdicted novelty 5.0

MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · cited by 1 Pith paper · 10 internal anchors

[1]

Mgnet: Monocular geo- metric scene understanding for autonomous driving,

M. Sch ¨on, M. Buchholz, and K. Dietmayer, “Mgnet: Monocular geo- metric scene understanding for autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021, pp. 15 804–15 815

work page 2021
[2]

Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,

Y . Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8445–8453

work page 2019
[3]

Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving,

Y . You, Y . Wang, W.-L. Chao, D. Garg, G. Pleiss, B. Hariha- ran, M. Campbell, and K. Q. Weinberger, “Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving,” arXiv preprint arXiv:1906.06310, 2019

work page arXiv 1906
[4]

Ro- bodepth: Robust out-of-distribution depth estimation under corruptions,

L. Kong, S. Xie, H. Hu, B. Cottereau, L. X. Ng, and W. T. Ooi, “Ro- bodepth: Robust out-of-distribution depth estimation under corruptions,” arXiv preprint arXiv:23xx.xxxxx , 2023

work page 2023
[5]

Consistent video depth estimation,

X. Luo, J.-B. Huang, R. Szeliski, K. Matzen, and J. Kopf, “Consistent video depth estimation,” ACM Transactions on Graphics (ToG), vol. 39, no. 4, pp. 71–1, 2020

work page 2020
[6]

Low power depth estimation of rigid objects for time-of-flight imaging,

J. Noraky and V . Sze, “Low power depth estimation of rigid objects for time-of-flight imaging,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 6, pp. 1524–1534, 2019

work page 2019
[7]

Adding conditional control to text-to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 3836–3847

work page 2023
[8]

Structure and content-guided video synthesis with diffusion models,

P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7346–7356

work page 2023
[9]

Depth anything: Unleashing the power of large-scale unlabeled data,

L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 371–10 381

work page 2024
[10]

Vision transformers for dense prediction,

R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021, pp. 12 179–12 188

work page 2021
[11]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. M ¨uller, “Zoedepth: Zero-shot transfer by combining relative and metric depth,” arXiv preprint arXiv:2302.12288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,

A. Eftekhar, A. Sax, J. Malik, and A. Zamir, “Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 786–10 796

work page 2021
[13]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,

R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V . Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 3, pp. 1623–1637, 2020

work page 2020
[14]

Diversedepth: Affine-invariant depth prediction using diverse data,

W. Yin, X. Wang, C. Shen, Y . Liu, Z. Tian, S. Xu, C. Sun, and D. Renyin, “Diversedepth: Affine-invariant depth prediction using diverse data,” arXiv preprint arXiv:2002.00569 , 2020

work page arXiv 2002
[15]

Scaledepth: Decomposing metric depth estimation into scale prediction and relative depth estimation,

R. Zhu, C. Wang, Z. Song, L. Liu, T. Zhang, and Y . Zhang, “Scaledepth: Decomposing metric depth estimation into scale prediction and relative depth estimation,” arXiv preprint arXiv:2407.08187 , 2024

work page arXiv 2024
[16]

Repurposing diffusion-based image generators for monoc- ular depth estimation,

B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monoc- ular depth estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 9492–9502

work page 2024
[17]

Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,

X. Fu, W. Yin, M. Hu, K. Wang, Y . Ma, P. Tan, S. Shen, D. Lin, and X. Long, “Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,” in European Conference on Computer Vision. Springer, 2025, pp. 241–258

work page 2025
[18]

Depthfm: Fast monocular depth estimation with flow matching,

M. Gui, J. Schusterbauer, U. Prestel, P. Ma, D. Kotovenko, O. Grebenkova, S. A. Baumann, V . T. Hu, and B. Ommer, “Depthfm: Fast monocular depth estimation with flow matching,” 2024

work page 2024
[19]

What matters when repurposing diffusion models for general dense perception tasks?

G. Xu, Y . Ge, M. Liu, C. Fan, K. Xie, Z. Zhao, H. Chen, and C. Shen, “What matters when repurposing diffusion models for general dense perception tasks?” arXiv preprint arXiv:2403.06090 , 2024

work page arXiv 2024
[20]

Lotus: Diffusion-based visual foundation model for high-quality dense prediction

J. He, H. Li, W. Yin, Y . Liang, L. Li, K. Zhou, H. Liu, B. Liu, and Y .-C. Chen, “Lotus: Diffusion-based visual foundation model for high-quality dense prediction,” arXiv preprint arXiv:2409.18124 , 2024

work page arXiv 2024
[21]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

work page 2022
[22]

Scaling rectified flow transformers for high-resolution image synthesis,

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel et al. , “Scaling rectified flow transformers for high-resolution image synthesis,” in Forty-first International Conference on Machine Learning , 2024

work page 2024
[23]

Deep unsupervised learning using nonequilibrium thermodynamics,

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning . PMLR, 2015, pp. 2256–2265

work page 2015
[24]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Y . Guo, C. Yang, A. Rao, Z. Liang, Y . Wang, Y . Qiao, M. Agrawala, D. Lin, and B. Dai, “Animatediff: Animate your personalized text- to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Animate anyone: Consistent and controllable image-to-video synthesis for character animation,

L. Hu, “Animate anyone: Consistent and controllable image-to-video synthesis for character animation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 8153–8163

work page 2024
[26]

Smartbrush: Text and shape guided object inpainting with diffusion model,

S. Xie, Z. Zhang, Z. Lin, T. Hinz, and K. Zhang, “Smartbrush: Text and shape guided object inpainting with diffusion model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 428–22 437

work page 2023
[27]

Hd-painter: high-resolution and prompt-faithful text-guided im- age inpainting with diffusion models,

H. Manukyan, A. Sargsyan, B. Atanyan, Z. Wang, S. Navasardyan, and H. Shi, “Hd-painter: high-resolution and prompt-faithful text-guided im- age inpainting with diffusion models,” arXiv preprint arXiv:2312.14091, 2023

work page arXiv 2023
[28]

Srdiff: Single image super-resolution with diffusion probabilistic mod- els,

H. Li, Y . Yang, M. Chang, S. Chen, H. Feng, Z. Xu, Q. Li, and Y . Chen, “Srdiff: Single image super-resolution with diffusion probabilistic mod- els,” Neurocomputing, vol. 479, pp. 47–59, 2022

work page 2022
[29]

Exploiting diffusion prior for real-world image super-resolution,

J. Wang, Z. Yue, S. Zhou, K. C. Chan, and C. C. Loy, “Exploiting diffusion prior for real-world image super-resolution,” International Journal of Computer Vision , pp. 1–21, 2024

work page 2024
[30]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems , vol. 33, pp. 6840– 6851, 2020

work page 2020
[31]

Score-Based Generative Modeling through Stochastic Differential Equations

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differ- ential equations,” arXiv preprint arXiv:2011.13456 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[32]

On fast sampling of diffusion probabilistic models,

Z. Kong and W. Ping, “On fast sampling of diffusion probabilistic models,” arXiv preprint arXiv:2106.00132 , 2021

work page arXiv 2021
[33]

Noise estim ation for generative diﬀusion models

R. San-Roman, E. Nachmani, and L. Wolf, “Noise estimation for generative diffusion models,” arXiv preprint arXiv:2104.02600 , 2021

work page arXiv 2021
[34]

Denoising Diffusion Implicit Models

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[35]

Cascaded diffusion models for high fidelity image generation,

J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, “Cascaded diffusion models for high fidelity image generation,” Journal of Machine Learning Research , vol. 23, no. 47, pp. 1–33, 2022

work page 2022
[36]

Score-based generative modeling in latent space,

A. Vahdat, K. Kreis, and J. Kautz, “Score-based generative modeling in latent space,” Advances in neural information processing systems , vol. 34, pp. 11 287–11 302, 2021

work page 2021
[37]

LAION-5b: An open large- scale dataset for training next generation image-text models,

C. Schuhmann, R. Beaumont, R. Vencu, C. W. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. R. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, and J. Jitsev, “LAION-5b: An open large- scale dataset for training next generation image-text models,” in Thirty-sixth Conference on Neural Information Processing...

work page 2022
[38]

A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27,

Y . LeCun, “A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27,” Open Review, vol. 62, no. 1, pp. 1–62, 2022

work page 2022
[39]

Self-supervised learning from images with a joint-embedding predictive architecture,

M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rab- bat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 15 619–15 629

work page 2023
[40]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

S. Yu, S. Kwak, H. Jang, J. Jeong, J. Huang, J. Shin, and S. Xie, “Rep- resentation alignment for generation: Training diffusion transformers is easier than you think,” arXiv preprint arXiv:2410.06940 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Vision meets robotics: The kitti dataset,

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” International Journal of Robotics Research (IJRR) , 2013

work page 2013
[42]

A naturalistic open source movie for optical flow evaluation,

D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” in Proceedings of the European Conference on Computer Vision (ECCV) , ser. Part IV , LNCS 7577, Oct. 2012, pp. 611–625

work page 2012
[43]

Sun rgb-d: A rgb-d scene under- standing benchmark suite,

S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene under- standing benchmark suite,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2015, pp. 567–576. 11

work page 2015
[44]

Indoor segmen- tation and support inference from rgbd images,

P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor segmen- tation and support inference from rgbd images,” in Proceedings of the European Conference on Computer Vision (ECCV) , 2012

work page 2012
[45]

Cornet: Context-based ordinal regression network for monocular depth estimation,

X. Meng, C. Fan, Y . Ming, and H. Yu, “Cornet: Context-based ordinal regression network for monocular depth estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 7, pp. 4841– 4853, 2021

work page 2021
[46]

Depth map prediction from a single image using a multi-scale deep network,

D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” Advances in neural information processing systems , vol. 27, 2014

work page 2014
[47]

Deeper depth prediction with fully convolutional residual networks,

I. Laina, C. Rupprecht, V . Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in 2016 Fourth international conference on 3D vision (3DV) . IEEE, 2016, pp. 239–248

work page 2016
[48]

Squeeze-and-excitation networks,

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141

work page 2018
[49]

Predicting depth, surface normals and se- mantic labels with a common multi-scale convolutional architecture,

D. Eigen and R. Fergus, “Predicting depth, surface normals and se- mantic labels with a common multi-scale convolutional architecture,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2015, pp. 2650–2658

work page 2015
[50]

Web stereo video super- vision for depth prediction from dynamic scenes,

C. Wang, S. Lucey, F. Perazzi, and O. Wang, “Web stereo video super- vision for depth prediction from dynamic scenes,” in 2019 International Conference on 3D Vision (3DV) . IEEE, 2019, pp. 348–357

work page 2019
[51]

Monocular depth estimation using laplacian pyramid-based depth residuals,

M. Song, S. Lim, and W. Kim, “Monocular depth estimation using laplacian pyramid-based depth residuals,” IEEE transactions on circuits and systems for video technology , vol. 31, no. 11, pp. 4381–4393, 2021

work page 2021
[52]

New crfs: Neural window fully-connected crfs for monocular depth estimation,

W. Yuan, X. Gu, Z. Dai, S. Zhu, and P. Tan, “New crfs: Neural window fully-connected crfs for monocular depth estimation,” arXiv preprint arXiv:2203.01502, 2022

work page arXiv 2022
[53]

Unleashing text-to-image diffusion models for visual perception,

W. Zhao, Y . Rao, Z. Liu, B. Liu, J. Zhou, and J. Lu, “Unleashing text-to-image diffusion models for visual perception,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 5729–5739

work page 2023
[54]

Ecodepth: Effective conditioning of diffusion models for monocular depth estimation,

S. Patni, A. Agarwal, and C. Arora, “Ecodepth: Effective conditioning of diffusion models for monocular depth estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28 285–28 295

work page 2024
[55]

Estimating depth from monocular images as classification using deep fully convolutional residual networks,

Y . Cao, Z. Wu, and C. Shen, “Estimating depth from monocular images as classification using deep fully convolutional residual networks,” IEEE Transactions on Circuits and Systems for Video Technology , vol. 28, no. 11, pp. 3174–3182, 2017

work page 2017
[56]

Monocular depth estimation with augmented ordinal depth relationships,

Y . Cao, T. Zhao, K. Xian, C. Shen, Z. Cao, and S. Xu, “Monocular depth estimation with augmented ordinal depth relationships,” IEEE Transactions on Circuits and Systems for Video Technology , vol. 30, no. 8, pp. 2674–2682, 2019

work page 2019
[57]

Adabins: Depth estimation using adaptive bins,

S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021, pp. 4009–4018

work page 2021
[58]

Binsformer: Revisiting adaptive bins for monocular depth estimation,

Z. Li, X. Wang, X. Liu, and J. Jiang, “Binsformer: Revisiting adaptive bins for monocular depth estimation,” arXiv preprint arXiv:2204.00987, 2022

work page arXiv 2022
[59]

Ha-bins: Hierarchical adaptive bins for robust monocular depth estimation across multiple datasets,

R. Zhu, Z. Song, L. Liu, J. He, T. Zhang, and Y . Zhang, “Ha-bins: Hierarchical adaptive bins for robust monocular depth estimation across multiple datasets,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 6, pp. 4354–4366, 2024

work page 2024
[60]

Enforcing geometric constraints of virtual normal for depth prediction,

W. Yin, Y . Liu, C. Shen, and Y . Yan, “Enforcing geometric constraints of virtual normal for depth prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2019, pp. 5684–5693

work page 2019
[61]

Ec-depth: Exploring the consistency of self-supervised monocular depth estimation under challenging scenes,

R. Zhu, Z. Song, C. Wang, J. He, and T. Zhang, “Ec-depth: Exploring the consistency of self-supervised monocular depth estimation under challenging scenes,” arXiv preprint arXiv:2310.08044 , 2023

work page arXiv 2023
[62]

Geonet: Geometric neural network for joint depth and surface normal estimation,

X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia, “Geonet: Geometric neural network for joint depth and surface normal estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 283–291

work page 2018
[63]

Plane2depth: Hierarchical adaptive plane guidance for monocular depth estimation,

L. Liu, R. Zhu, J. Deng, Z. Song, W. Yang, and T. Zhang, “Plane2depth: Hierarchical adaptive plane guidance for monocular depth estimation,” IEEE Transactions on Circuits and Systems for Video Technology, 2024

work page 2024
[64]

Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing,

D. Xu, W. Ouyang, X. Wang, and N. Sebe, “Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2018, pp. 675–684

work page 2018
[65]

Towards scene understanding: Unsupervised monocular depth estimation with semantic- aware representation,

P.-Y . Chen, A. H. Liu, Y .-C. Liu, and Y .-C. F. Wang, “Towards scene understanding: Unsupervised monocular depth estimation with semantic- aware representation,” in Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition , 2019, pp. 2624–2632

work page 2019
[66]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al. , “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Diffusionedge: Diffusion probabilistic model for crisp edge detection,

Y . Ye, K. Xu, Y . Huang, R. Yi, and Z. Cai, “Diffusionedge: Diffusion probabilistic model for crisp edge detection,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 38, no. 7, 2024, pp. 6675– 6683

work page 2024
[68]

Robust estimation of a location parameter,

P. J. Huber, “Robust estimation of a location parameter,” in Break- throughs in statistics: Methodology and distribution . Springer, 1992, pp. 492–518

work page 1992
[69]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[70]

Learning to recover 3d scene shape from a single image,

W. Yin, J. Zhang, O. Wang, S. Niklaus, L. Mai, S. Chen, and C. Shen, “Learning to recover 3d scene shape from a single image,” in Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 204–213

work page 2021
[71]

Hierarchical normalization for robust monocular depth estimation,

C. Zhang, W. Yin, B. Wang, G. Yu, B. Fu, and C. Shen, “Hierarchical normalization for robust monocular depth estimation,” Advances in Neural Information Processing Systems , vol. 35, pp. 14 128–14 139, 2022

work page 2022
[72]

Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,

M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind, “Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 912–10 922

work page 2021
[73]

Virtual KITTI 2

Y . Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,” arXiv preprint arXiv:2001.10773, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[74]

Vision meets robotics: The kitti dataset,

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research , vol. 32, no. 11, pp. 1231–1237, 2013

work page 2013
[75]

Indoor segmentation and support inference from rgbd images,

N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12 . Springer, 2012, pp. 746– 760

work page 2012
[76]

Scannet: Richly-annotated 3d reconstructions of indoor scenes,

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839

work page 2017
[77]

A multi-view stereo benchmark with high- resolution images and multi-camera videos,

T. Schops, J. L. Schonberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger, “A multi-view stereo benchmark with high- resolution images and multi-camera videos,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3260– 3269

work page 2017
[78]

Diode: A dense indoor and outdoor depth dataset,

I. Vasiljevic, N. Kolkin, S. Zhang, R. Luo, H. Wang, F. Z. Dai, A. F. Daniele, M. Mostajabi, S. Basart, M. R. Walter et al., “Diode: A dense indoor and outdoor depth dataset,” arXiv preprint arXiv:1908.00463 , 2019

work page arXiv 1908
[79]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y . Zhou, S. R. Richter, and V . Koltun, “Depth pro: Sharp monocular metric depth in less than a second,” arXiv preprint arXiv:2410.02073 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[80]

Reproducible scaling laws for contrastive language-image learning,

M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Reproducible scaling laws for contrastive language-image learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2818–2829

work page 2023

Showing first 80 references.

[1] [1]

Mgnet: Monocular geo- metric scene understanding for autonomous driving,

M. Sch ¨on, M. Buchholz, and K. Dietmayer, “Mgnet: Monocular geo- metric scene understanding for autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021, pp. 15 804–15 815

work page 2021

[2] [2]

Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,

Y . Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8445–8453

work page 2019

[3] [3]

Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving,

Y . You, Y . Wang, W.-L. Chao, D. Garg, G. Pleiss, B. Hariha- ran, M. Campbell, and K. Q. Weinberger, “Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving,” arXiv preprint arXiv:1906.06310, 2019

work page arXiv 1906

[4] [4]

Ro- bodepth: Robust out-of-distribution depth estimation under corruptions,

L. Kong, S. Xie, H. Hu, B. Cottereau, L. X. Ng, and W. T. Ooi, “Ro- bodepth: Robust out-of-distribution depth estimation under corruptions,” arXiv preprint arXiv:23xx.xxxxx , 2023

work page 2023

[5] [5]

Consistent video depth estimation,

X. Luo, J.-B. Huang, R. Szeliski, K. Matzen, and J. Kopf, “Consistent video depth estimation,” ACM Transactions on Graphics (ToG), vol. 39, no. 4, pp. 71–1, 2020

work page 2020

[6] [6]

Low power depth estimation of rigid objects for time-of-flight imaging,

J. Noraky and V . Sze, “Low power depth estimation of rigid objects for time-of-flight imaging,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 6, pp. 1524–1534, 2019

work page 2019

[7] [7]

Adding conditional control to text-to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 3836–3847

work page 2023

[8] [8]

Structure and content-guided video synthesis with diffusion models,

P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7346–7356

work page 2023

[9] [9]

Depth anything: Unleashing the power of large-scale unlabeled data,

L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 371–10 381

work page 2024

[10] [10]

Vision transformers for dense prediction,

R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021, pp. 12 179–12 188

work page 2021

[11] [11]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. M ¨uller, “Zoedepth: Zero-shot transfer by combining relative and metric depth,” arXiv preprint arXiv:2302.12288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,

A. Eftekhar, A. Sax, J. Malik, and A. Zamir, “Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 786–10 796

work page 2021

[13] [13]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,

R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V . Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 3, pp. 1623–1637, 2020

work page 2020

[14] [14]

Diversedepth: Affine-invariant depth prediction using diverse data,

W. Yin, X. Wang, C. Shen, Y . Liu, Z. Tian, S. Xu, C. Sun, and D. Renyin, “Diversedepth: Affine-invariant depth prediction using diverse data,” arXiv preprint arXiv:2002.00569 , 2020

work page arXiv 2002

[15] [15]

Scaledepth: Decomposing metric depth estimation into scale prediction and relative depth estimation,

R. Zhu, C. Wang, Z. Song, L. Liu, T. Zhang, and Y . Zhang, “Scaledepth: Decomposing metric depth estimation into scale prediction and relative depth estimation,” arXiv preprint arXiv:2407.08187 , 2024

work page arXiv 2024

[16] [16]

Repurposing diffusion-based image generators for monoc- ular depth estimation,

B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monoc- ular depth estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 9492–9502

work page 2024

[17] [17]

Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,

X. Fu, W. Yin, M. Hu, K. Wang, Y . Ma, P. Tan, S. Shen, D. Lin, and X. Long, “Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,” in European Conference on Computer Vision. Springer, 2025, pp. 241–258

work page 2025

[18] [18]

Depthfm: Fast monocular depth estimation with flow matching,

M. Gui, J. Schusterbauer, U. Prestel, P. Ma, D. Kotovenko, O. Grebenkova, S. A. Baumann, V . T. Hu, and B. Ommer, “Depthfm: Fast monocular depth estimation with flow matching,” 2024

work page 2024

[19] [19]

What matters when repurposing diffusion models for general dense perception tasks?

G. Xu, Y . Ge, M. Liu, C. Fan, K. Xie, Z. Zhao, H. Chen, and C. Shen, “What matters when repurposing diffusion models for general dense perception tasks?” arXiv preprint arXiv:2403.06090 , 2024

work page arXiv 2024

[20] [20]

Lotus: Diffusion-based visual foundation model for high-quality dense prediction

J. He, H. Li, W. Yin, Y . Liang, L. Li, K. Zhou, H. Liu, B. Liu, and Y .-C. Chen, “Lotus: Diffusion-based visual foundation model for high-quality dense prediction,” arXiv preprint arXiv:2409.18124 , 2024

work page arXiv 2024

[21] [21]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

work page 2022

[22] [22]

Scaling rectified flow transformers for high-resolution image synthesis,

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel et al. , “Scaling rectified flow transformers for high-resolution image synthesis,” in Forty-first International Conference on Machine Learning , 2024

work page 2024

[23] [23]

Deep unsupervised learning using nonequilibrium thermodynamics,

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning . PMLR, 2015, pp. 2256–2265

work page 2015

[24] [24]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Y . Guo, C. Yang, A. Rao, Z. Liang, Y . Wang, Y . Qiao, M. Agrawala, D. Lin, and B. Dai, “Animatediff: Animate your personalized text- to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Animate anyone: Consistent and controllable image-to-video synthesis for character animation,

L. Hu, “Animate anyone: Consistent and controllable image-to-video synthesis for character animation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 8153–8163

work page 2024

[26] [26]

Smartbrush: Text and shape guided object inpainting with diffusion model,

S. Xie, Z. Zhang, Z. Lin, T. Hinz, and K. Zhang, “Smartbrush: Text and shape guided object inpainting with diffusion model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 428–22 437

work page 2023

[27] [27]

Hd-painter: high-resolution and prompt-faithful text-guided im- age inpainting with diffusion models,

H. Manukyan, A. Sargsyan, B. Atanyan, Z. Wang, S. Navasardyan, and H. Shi, “Hd-painter: high-resolution and prompt-faithful text-guided im- age inpainting with diffusion models,” arXiv preprint arXiv:2312.14091, 2023

work page arXiv 2023

[28] [28]

Srdiff: Single image super-resolution with diffusion probabilistic mod- els,

H. Li, Y . Yang, M. Chang, S. Chen, H. Feng, Z. Xu, Q. Li, and Y . Chen, “Srdiff: Single image super-resolution with diffusion probabilistic mod- els,” Neurocomputing, vol. 479, pp. 47–59, 2022

work page 2022

[29] [29]

Exploiting diffusion prior for real-world image super-resolution,

J. Wang, Z. Yue, S. Zhou, K. C. Chan, and C. C. Loy, “Exploiting diffusion prior for real-world image super-resolution,” International Journal of Computer Vision , pp. 1–21, 2024

work page 2024

[30] [30]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems , vol. 33, pp. 6840– 6851, 2020

work page 2020

[31] [31]

Score-Based Generative Modeling through Stochastic Differential Equations

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differ- ential equations,” arXiv preprint arXiv:2011.13456 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011

[32] [32]

On fast sampling of diffusion probabilistic models,

Z. Kong and W. Ping, “On fast sampling of diffusion probabilistic models,” arXiv preprint arXiv:2106.00132 , 2021

work page arXiv 2021

[33] [33]

Noise estim ation for generative diﬀusion models

R. San-Roman, E. Nachmani, and L. Wolf, “Noise estimation for generative diffusion models,” arXiv preprint arXiv:2104.02600 , 2021

work page arXiv 2021

[34] [34]

Denoising Diffusion Implicit Models

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502 , 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[35] [35]

Cascaded diffusion models for high fidelity image generation,

J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, “Cascaded diffusion models for high fidelity image generation,” Journal of Machine Learning Research , vol. 23, no. 47, pp. 1–33, 2022

work page 2022

[36] [36]

Score-based generative modeling in latent space,

A. Vahdat, K. Kreis, and J. Kautz, “Score-based generative modeling in latent space,” Advances in neural information processing systems , vol. 34, pp. 11 287–11 302, 2021

work page 2021

[37] [37]

LAION-5b: An open large- scale dataset for training next generation image-text models,

C. Schuhmann, R. Beaumont, R. Vencu, C. W. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. R. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, and J. Jitsev, “LAION-5b: An open large- scale dataset for training next generation image-text models,” in Thirty-sixth Conference on Neural Information Processing...

work page 2022

[38] [38]

A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27,

Y . LeCun, “A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27,” Open Review, vol. 62, no. 1, pp. 1–62, 2022

work page 2022

[39] [39]

Self-supervised learning from images with a joint-embedding predictive architecture,

M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rab- bat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 15 619–15 629

work page 2023

[40] [40]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

S. Yu, S. Kwak, H. Jang, J. Jeong, J. Huang, J. Shin, and S. Xie, “Rep- resentation alignment for generation: Training diffusion transformers is easier than you think,” arXiv preprint arXiv:2410.06940 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Vision meets robotics: The kitti dataset,

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” International Journal of Robotics Research (IJRR) , 2013

work page 2013

[42] [42]

A naturalistic open source movie for optical flow evaluation,

D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” in Proceedings of the European Conference on Computer Vision (ECCV) , ser. Part IV , LNCS 7577, Oct. 2012, pp. 611–625

work page 2012

[43] [43]

Sun rgb-d: A rgb-d scene under- standing benchmark suite,

S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene under- standing benchmark suite,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2015, pp. 567–576. 11

work page 2015

[44] [44]

Indoor segmen- tation and support inference from rgbd images,

P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor segmen- tation and support inference from rgbd images,” in Proceedings of the European Conference on Computer Vision (ECCV) , 2012

work page 2012

[45] [45]

Cornet: Context-based ordinal regression network for monocular depth estimation,

X. Meng, C. Fan, Y . Ming, and H. Yu, “Cornet: Context-based ordinal regression network for monocular depth estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 7, pp. 4841– 4853, 2021

work page 2021

[46] [46]

Depth map prediction from a single image using a multi-scale deep network,

D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” Advances in neural information processing systems , vol. 27, 2014

work page 2014

[47] [47]

Deeper depth prediction with fully convolutional residual networks,

I. Laina, C. Rupprecht, V . Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in 2016 Fourth international conference on 3D vision (3DV) . IEEE, 2016, pp. 239–248

work page 2016

[48] [48]

Squeeze-and-excitation networks,

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141

work page 2018

[49] [49]

Predicting depth, surface normals and se- mantic labels with a common multi-scale convolutional architecture,

D. Eigen and R. Fergus, “Predicting depth, surface normals and se- mantic labels with a common multi-scale convolutional architecture,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2015, pp. 2650–2658

work page 2015

[50] [50]

Web stereo video super- vision for depth prediction from dynamic scenes,

C. Wang, S. Lucey, F. Perazzi, and O. Wang, “Web stereo video super- vision for depth prediction from dynamic scenes,” in 2019 International Conference on 3D Vision (3DV) . IEEE, 2019, pp. 348–357

work page 2019

[51] [51]

Monocular depth estimation using laplacian pyramid-based depth residuals,

M. Song, S. Lim, and W. Kim, “Monocular depth estimation using laplacian pyramid-based depth residuals,” IEEE transactions on circuits and systems for video technology , vol. 31, no. 11, pp. 4381–4393, 2021

work page 2021

[52] [52]

New crfs: Neural window fully-connected crfs for monocular depth estimation,

W. Yuan, X. Gu, Z. Dai, S. Zhu, and P. Tan, “New crfs: Neural window fully-connected crfs for monocular depth estimation,” arXiv preprint arXiv:2203.01502, 2022

work page arXiv 2022

[53] [53]

Unleashing text-to-image diffusion models for visual perception,

W. Zhao, Y . Rao, Z. Liu, B. Liu, J. Zhou, and J. Lu, “Unleashing text-to-image diffusion models for visual perception,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 5729–5739

work page 2023

[54] [54]

Ecodepth: Effective conditioning of diffusion models for monocular depth estimation,

S. Patni, A. Agarwal, and C. Arora, “Ecodepth: Effective conditioning of diffusion models for monocular depth estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28 285–28 295

work page 2024

[55] [55]

Estimating depth from monocular images as classification using deep fully convolutional residual networks,

Y . Cao, Z. Wu, and C. Shen, “Estimating depth from monocular images as classification using deep fully convolutional residual networks,” IEEE Transactions on Circuits and Systems for Video Technology , vol. 28, no. 11, pp. 3174–3182, 2017

work page 2017

[56] [56]

Monocular depth estimation with augmented ordinal depth relationships,

Y . Cao, T. Zhao, K. Xian, C. Shen, Z. Cao, and S. Xu, “Monocular depth estimation with augmented ordinal depth relationships,” IEEE Transactions on Circuits and Systems for Video Technology , vol. 30, no. 8, pp. 2674–2682, 2019

work page 2019

[57] [57]

Adabins: Depth estimation using adaptive bins,

S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021, pp. 4009–4018

work page 2021

[58] [58]

Binsformer: Revisiting adaptive bins for monocular depth estimation,

Z. Li, X. Wang, X. Liu, and J. Jiang, “Binsformer: Revisiting adaptive bins for monocular depth estimation,” arXiv preprint arXiv:2204.00987, 2022

work page arXiv 2022

[59] [59]

Ha-bins: Hierarchical adaptive bins for robust monocular depth estimation across multiple datasets,

R. Zhu, Z. Song, L. Liu, J. He, T. Zhang, and Y . Zhang, “Ha-bins: Hierarchical adaptive bins for robust monocular depth estimation across multiple datasets,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 6, pp. 4354–4366, 2024

work page 2024

[60] [60]

Enforcing geometric constraints of virtual normal for depth prediction,

W. Yin, Y . Liu, C. Shen, and Y . Yan, “Enforcing geometric constraints of virtual normal for depth prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2019, pp. 5684–5693

work page 2019

[61] [61]

Ec-depth: Exploring the consistency of self-supervised monocular depth estimation under challenging scenes,

R. Zhu, Z. Song, C. Wang, J. He, and T. Zhang, “Ec-depth: Exploring the consistency of self-supervised monocular depth estimation under challenging scenes,” arXiv preprint arXiv:2310.08044 , 2023

work page arXiv 2023

[62] [62]

Geonet: Geometric neural network for joint depth and surface normal estimation,

X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia, “Geonet: Geometric neural network for joint depth and surface normal estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 283–291

work page 2018

[63] [63]

Plane2depth: Hierarchical adaptive plane guidance for monocular depth estimation,

L. Liu, R. Zhu, J. Deng, Z. Song, W. Yang, and T. Zhang, “Plane2depth: Hierarchical adaptive plane guidance for monocular depth estimation,” IEEE Transactions on Circuits and Systems for Video Technology, 2024

work page 2024

[64] [64]

Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing,

D. Xu, W. Ouyang, X. Wang, and N. Sebe, “Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2018, pp. 675–684

work page 2018

[65] [65]

Towards scene understanding: Unsupervised monocular depth estimation with semantic- aware representation,

P.-Y . Chen, A. H. Liu, Y .-C. Liu, and Y .-C. F. Wang, “Towards scene understanding: Unsupervised monocular depth estimation with semantic- aware representation,” in Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition , 2019, pp. 2624–2632

work page 2019

[66] [66]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al. , “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[67] [67]

Diffusionedge: Diffusion probabilistic model for crisp edge detection,

Y . Ye, K. Xu, Y . Huang, R. Yi, and Z. Cai, “Diffusionedge: Diffusion probabilistic model for crisp edge detection,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 38, no. 7, 2024, pp. 6675– 6683

work page 2024

[68] [68]

Robust estimation of a location parameter,

P. J. Huber, “Robust estimation of a location parameter,” in Break- throughs in statistics: Methodology and distribution . Springer, 1992, pp. 492–518

work page 1992

[69] [69]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[70] [70]

Learning to recover 3d scene shape from a single image,

W. Yin, J. Zhang, O. Wang, S. Niklaus, L. Mai, S. Chen, and C. Shen, “Learning to recover 3d scene shape from a single image,” in Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 204–213

work page 2021

[71] [71]

Hierarchical normalization for robust monocular depth estimation,

C. Zhang, W. Yin, B. Wang, G. Yu, B. Fu, and C. Shen, “Hierarchical normalization for robust monocular depth estimation,” Advances in Neural Information Processing Systems , vol. 35, pp. 14 128–14 139, 2022

work page 2022

[72] [72]

Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,

M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind, “Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 912–10 922

work page 2021

[73] [73]

Virtual KITTI 2

Y . Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,” arXiv preprint arXiv:2001.10773, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[74] [74]

Vision meets robotics: The kitti dataset,

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research , vol. 32, no. 11, pp. 1231–1237, 2013

work page 2013

[75] [75]

Indoor segmentation and support inference from rgbd images,

N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12 . Springer, 2012, pp. 746– 760

work page 2012

[76] [76]

Scannet: Richly-annotated 3d reconstructions of indoor scenes,

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839

work page 2017

[77] [77]

A multi-view stereo benchmark with high- resolution images and multi-camera videos,

T. Schops, J. L. Schonberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger, “A multi-view stereo benchmark with high- resolution images and multi-camera videos,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3260– 3269

work page 2017

[78] [78]

Diode: A dense indoor and outdoor depth dataset,

I. Vasiljevic, N. Kolkin, S. Zhang, R. Luo, H. Wang, F. Z. Dai, A. F. Daniele, M. Mostajabi, S. Basart, M. R. Walter et al., “Diode: A dense indoor and outdoor depth dataset,” arXiv preprint arXiv:1908.00463 , 2019

work page arXiv 1908

[79] [79]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y . Zhou, S. R. Richter, and V . Koltun, “Depth pro: Sharp monocular metric depth in less than a second,” arXiv preprint arXiv:2410.02073 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[80] [80]

Reproducible scaling laws for contrastive language-image learning,

M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Reproducible scaling laws for contrastive language-image learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2818–2829

work page 2023