arxiv: 2604.07664 · v1 · submitted 2026-04-09 · 💻 cs.CV · eess.IV

Recognition: unknown

Monocular Depth Estimation From the Perspective of Feature Restoration: A Diffusion Enhanced Depth Restoration Approach

Huibin Bai , Shuai Li , Hanxiao Zhai , Yanbo Gao , Chong Lv , Yibo Wang , Haipeng Ping , Wei Hua

show 1 more author

Xingyu Gao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:24 UTC · model grok-4.3

classification 💻 cs.CV eess.IV

keywords monocular depth estimationfeature restorationdiffusion processinvertible transformsdepth predictionencoder featuresKITTI datasetauxiliary viewpoint

0 comments

The pith

Treating encoder features as degraded and restoring them via invertible diffusion produces more accurate monocular depth estimates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates monocular depth estimation and finds that the encoder-decoder setup leaves room for improvement if the encoder features can be enhanced. It reframes the task as restoring these features, assumed to be degraded versions of ideal ones that would generate the ground-truth depth. An invertible diffusion module is introduced to iteratively restore the features using only indirect supervision from the output depth map, preventing step-to-step drift through a bi-Lipschitz transform. This leads to better depth predictions, as shown by gains on multiple datasets. The approach also includes an optional module to boost low-level details from an auxiliary viewpoint when available.

Core claim

By viewing pretrained encoder features in monocular depth estimation as degraded inputs and applying an Invertible Transform-enhanced Indirect Diffusion module to restore them, the method achieves higher accuracy in predicted depth maps. The bi-Lipschitz condition in the decoder ensures stability during the diffusion iterations despite lacking direct feature supervision. Adding the Auxiliary Viewpoint-based Low-level Feature Enhancement module further refines local details. This results in superior performance compared to prior methods on standard benchmarks like KITTI.

What carries the argument

The InvT-IndDiffusion module that performs diffusion-based restoration of encoder features under indirect supervision from the depth map, stabilized by an invertible transform satisfying the bi-Lipschitz condition.

If this is right

The proposed method outperforms state-of-the-art approaches on various datasets.
On the KITTI benchmark the RMSE improves by 4.09 percent and 37.77 percent under different training settings compared to the baseline.
The AV-LFE module enhances local details in a plug-and-play manner when an auxiliary viewpoint is available.
Substantial potential exists in current encoder-decoder architectures once encoder features are properly restored.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar restoration techniques could be tested on related tasks such as surface normal estimation or optical flow to check transferability.
The stability provided by the invertible transform suggests that diffusion processes in vision tasks benefit from structure-preserving decoders.
If the indirect supervision works well here, it may allow training depth models with less direct supervision in other settings.

Load-bearing premise

That the features from a pretrained encoder are degraded versions of some ground-truth feature whose corresponding depth matches the actual ground truth, allowing the diffusion to be guided effectively by the final depth output alone.

What would settle it

Running the baseline encoder-decoder without the InvT-IndDiffusion module and observing whether the RMSE on KITTI remains the same or worsens compared to the full method.

Figures

Figures reproduced from arXiv: 2604.07664 by Chong Lv, Haipeng Ping, Hanxiao Zhai, Huibin Bai, Shuai Li, Wei Hua, Xingyu Gao, Yanbo Gao, Yibo Wang.

**Figure 1.** Figure 1: Illustration of the feature deviation problem in the diffusion process due to the indirect supervision and iterative inference. It can be mitigated with [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Effects of different levels of features on the final depth prediction [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed IID-RDepth. The input image is first processed through a frozen encoder to extract the multi-level features. Then the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results on the KITTI dataset comparing our IID-RDepth with existing methods. IID-RDepth utilizes the proposed InvT-IndDiffusion, while [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Visual illustration of the predicted depth maps and their differences when using the InvT-IndDiffusion with different interference steps, where the blue [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative Results of Ablation Studies on the AV-LFE Module in the KITTI [48] Dataset. A different pseudo color scheme, changing the colormap [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Monocular Depth Estimation (MDE) is a fundamental computer vision task with important applications in 3D vision. The current mainstream MDE methods employ an encoder-decoder architecture with multi-level/scale feature processing. However, the limitations of the current architecture and the effects of different-level features on the prediction accuracy are not evaluated. In this paper, we first investigate the above problem and show that there is still substantial potential in the current framework if encoder features can be improved. Therefore, we propose to formulate the depth estimation problem from the feature restoration perspective, by treating pretrained encoder features as degraded features of an assumed ground truth feature that yields the ground truth depth map. Then an Invertible Transform-enhanced Indirect Diffusion (InvT-IndDiffusion) module is developed for feature restoration. Due to the absence of direct supervision on feature, only indirect supervision from the final sparse depth map is used. During the iterative procedure of diffusion, this results in feature deviations among steps. The proposed InvT-IndDiffusion solves this problem by using an invertible transform-based decoder under the bi-Lipschitz condition. Finally, a plug-and-play Auxiliary Viewpoint-based Low-level Feature Enhancement module (AV-LFE) is developed to enhance local details with auxiliary viewpoint when available. Experiments demonstrate that the proposed method achieves better performance than the state-of-the-art methods on various datasets. Specifically on the KITTI benchmark, compared with the baseline, the performance is improved by 4.09% and 37.77% under different training settings in terms of RMSE. Code is available at https://github.com/whitehb1/IID-RDepth.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes monocular depth estimation as invertible diffusion-based feature restoration and reports RMSE gains on KITTI, but indirect supervision makes it unclear whether the claimed mechanism drives the results.

read the letter

The main point is that the authors treat pretrained encoder features as degraded versions of some ideal feature and restore them with an InvT-IndDiffusion module that adds an invertible transform to keep the process bi-Lipschitz during diffusion steps. They also add a plug-and-play AV-LFE module for low-level details when auxiliary views are available, and they show RMSE drops of 4.09% and 37.77% on KITTI versus their baseline under two training regimes, plus better numbers than prior work on other datasets, with code released on GitHub.

Referee Report

2 major / 2 minor

Summary. The paper reframes monocular depth estimation as restoring degraded pretrained encoder features to an assumed ground-truth feature via an Invertible Transform-enhanced Indirect Diffusion (InvT-IndDiffusion) module that uses only indirect supervision from the final sparse depth map, augmented by a plug-and-play Auxiliary Viewpoint-based Low-level Feature Enhancement (AV-LFE) module; it reports state-of-the-art results with RMSE gains of 4.09% and 37.77% on KITTI under different training settings.

Significance. If the reported gains are shown to arise specifically from controlled feature restoration rather than training differences or the auxiliary module, the work would meaningfully advance MDE by demonstrating that diffusion-based restoration can improve encoder features in standard architectures and by supplying a practical enhancement module.

major comments (2)

[Method (InvT-IndDiffusion module)] Method section on InvT-IndDiffusion: the claim that the invertible transform decoder under the bi-Lipschitz condition reliably prevents step-to-step feature deviations is unsupported by either a proof that the learned decoder satisfies the condition or any empirical measurement of feature trajectory stability; given the many-to-one feature-to-depth mapping and purely indirect supervision from the sparse depth loss, gradients may steer restored features toward any depth-reducing configuration rather than the intended ground-truth restoration.
[Experiments] Experiments: the abstract reports 4.09% and 37.77% RMSE improvements on KITTI versus the baseline, yet no ablation studies isolate the contribution of InvT-IndDiffusion from AV-LFE, no error bars or multi-run statistics are supplied, and baseline details are insufficient to rule out unstated training differences as the source of the gains.

minor comments (2)

[Abstract] Abstract: the statement that the method achieves 'better performance than the state-of-the-art methods on various datasets' should name the specific competing methods and list all metrics used, not only RMSE.
[Method] Notation: the diffusion noise schedule, step count, and invertible transform parameters are listed as free parameters but receive no explicit discussion of how they are chosen or ablated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional analysis and experiments as outlined.

read point-by-point responses

Referee: [Method (InvT-IndDiffusion module)] Method section on InvT-IndDiffusion: the claim that the invertible transform decoder under the bi-Lipschitz condition reliably prevents step-to-step feature deviations is unsupported by either a proof that the learned decoder satisfies the condition or any empirical measurement of feature trajectory stability; given the many-to-one feature-to-depth mapping and purely indirect supervision from the sparse depth loss, gradients may steer restored features toward any depth-reducing configuration rather than the intended ground-truth restoration.

Authors: We appreciate the referee highlighting this gap. The current manuscript does not contain a formal proof that the learned decoder satisfies the bi-Lipschitz condition, nor does it report empirical measurements of feature trajectory stability across diffusion steps. The InvT-IndDiffusion module employs an invertible transform decoder whose design draws on the information-preserving properties of bi-Lipschitz mappings commonly used in flow-based models to constrain deviations during the iterative restoration process. To address the concern directly, we will add a new subsection with empirical measurements (e.g., L2 distances between consecutive restored features) comparing trajectories with and without the invertible decoder. Regarding the many-to-one mapping and indirect supervision, the diffusion is initialized from the pretrained encoder features and guided iteratively by the depth loss; the invertible decoder enforces reversibility, which we argue limits arbitrary drift toward depth-reducing configurations. We will expand the method description to clarify this inductive bias and acknowledge the limitations of purely indirect supervision. revision: yes
Referee: [Experiments] Experiments: the abstract reports 4.09% and 37.77% RMSE improvements on KITTI versus the baseline, yet no ablation studies isolate the contribution of InvT-IndDiffusion from AV-LFE, no error bars or multi-run statistics are supplied, and baseline details are insufficient to rule out unstated training differences as the source of the gains.

Authors: We agree that the experimental validation needs to be strengthened to isolate component contributions and demonstrate statistical reliability. In the revised version we will add ablation tables that evaluate the full model, the model without InvT-IndDiffusion, the model without AV-LFE, and the baseline, all under identical training protocols. We will also perform at least three independent training runs for the main results and report mean RMSE with standard deviation (error bars) on KITTI. Finally, we will expand the experimental setup section with complete baseline implementation details, including exact hyper-parameters, optimizer settings, data augmentation, and training schedules, to enable direct reproduction and exclude unstated training differences as the source of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper formulates MDE as feature restoration by assumption, introduces InvT-IndDiffusion and AV-LFE modules, and trains them end-to-end using indirect depth-map supervision. Reported gains (4.09% and 37.77% RMSE on KITTI) are empirical benchmark results, not identities obtained by substituting fitted parameters or self-citations back into the output. No equation reduces the final depth to a hyperparameter by construction, no load-bearing self-citation chain exists, and the bi-Lipschitz claim is an architectural assertion rather than a tautology. The method therefore remains externally falsifiable on standard datasets.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claim rests on two domain assumptions and two new modules whose parameters are learned rather than derived.

free parameters (2)

Diffusion noise schedule and step count
Standard diffusion hyperparameters that must be chosen or tuned for the feature-restoration task.
Invertible transform parameters
Learned weights required to satisfy the bi-Lipschitz condition in the decoder.

axioms (2)

domain assumption Pretrained encoder features are degraded versions of an ideal ground-truth feature that would produce the ground-truth depth map
Invoked in the second paragraph of the abstract to justify the restoration formulation.
ad hoc to paper Indirect supervision from the final depth map is adequate to train the diffusion restoration despite the absence of direct feature-level supervision
Stated explicitly when describing the iterative diffusion procedure.

invented entities (2)

InvT-IndDiffusion module no independent evidence
purpose: Perform feature restoration via diffusion while preventing step-wise drift through an invertible decoder
New module introduced to solve the feature-deviation problem under indirect supervision.
AV-LFE module no independent evidence
purpose: Enhance low-level local details using an auxiliary viewpoint when available
Plug-and-play component added for detail recovery.

pith-pipeline@v0.9.0 · 5623 in / 1536 out tokens · 71157 ms · 2026-05-10T17:24:41.620130+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 6 canonical work pages · 3 internal anchors

[1]

Self-supervised multi-frame monocular depth estimation for dynamic scenes,

G. Wu, H. Liu, L. Wang, K. Li, Y . Guo, and Z. Chen, “Self-supervised multi-frame monocular depth estimation for dynamic scenes,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 6, pp. 4989–5001, 2024

2024
[2]

Towards comprehensive monocular depth estimation: Multiple heads are better than one,

S. Shao, R. Li, Z. Pei, Z. Liu, W. Chen, W. Zhu, X. Wu, and B. Zhang, “Towards comprehensive monocular depth estimation: Multiple heads are better than one,”IEEE Transactions on Multimedia, vol. 25, pp. 7660–7671, 2023

2023
[3]

As-deformable-as- possible single-image-based view synthesis without depth prior,

C. Zhang, C. Lin, K. Liao, L. Nie, and Y . Zhao, “As-deformable-as- possible single-image-based view synthesis without depth prior,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 8, pp. 3989–4001, 2023

2023
[4]

A depth-bin- based graphical model for fast view synthesis distortion estimation,

J. Jin, J. Liang, Y . Zhao, C. Lin, C. Yao, and A. Wang, “A depth-bin- based graphical model for fast view synthesis distortion estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 6, pp. 1754–1766, 2019

2019
[5]

Unsupervised monocular depth estimation using attention and multi-warp reconstruction,

C. Ling, X. Zhang, and H. Chen, “Unsupervised monocular depth estimation using attention and multi-warp reconstruction,”IEEE Trans- actions on Multimedia, vol. 24, pp. 2938–2949, 2021

2021
[6]

Distortion-aware self-supervised indoor 360 ◦ depth estimation via hybrid projection fusion and structural regularities,

X. Wang, W. Kong, Q. Zhang, Y . Yang, T. Zhao, and J. Jiang, “Distortion-aware self-supervised indoor 360 ◦ depth estimation via hybrid projection fusion and structural regularities,”IEEE Transactions on Multimedia, vol. 26, pp. 3998–4011, 2024

2024
[7]

Bayesian denet: Monocular depth prediction and frame-wise fusion with synchronized uncertainty,

X. Yang, Y . Gao, H. Luo, C. Liao, and K.-T. Cheng, “Bayesian denet: Monocular depth prediction and frame-wise fusion with synchronized uncertainty,”IEEE Transactions on Multimedia, vol. 21, no. 11, pp. 2701–2713, 2019

2019
[8]

Repurposing diffusion-based image generators for monoc- ular depth estimation,

B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monoc- ular depth estimation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 9492–9502

2024
[9]

Neural contourlet network for monocular 360° depth estimation,

Z. Shen, C. Lin, L. Nie, K. Liao, and Y . Zhao, “Neural contourlet network for monocular 360° depth estimation,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 12, pp. 8574– 8585, 2022

2022
[10]

Ha-bins: Hierarchical adaptive bins for robust monocular depth estimation across multiple datasets,

R. Zhu, Z. Song, L. Liu, J. He, T. Zhang, and Y . Zhang, “Ha-bins: Hierarchical adaptive bins for robust monocular depth estimation across multiple datasets,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 6, pp. 4354–4366, 2024

2024
[11]

Liftformer: Lifting and frame theory based monocular depth estima- tion using depth and edge oriented subspace representation,

S. Li, H. Bai, Y . Gao, C. Lv, H. Yuan, C. Li, W. Hua, and T. Xie, “Liftformer: Lifting and frame theory based monocular depth estima- tion using depth and edge oriented subspace representation,” inIEEE Transactions on Multimedia, 2025

2025
[12]

Monocular depth estimation with augmented ordinal depth relationships,

Y . Cao, T. Zhao, K. Xian, C. Shen, Z. Cao, and S. Xu, “Monocular depth estimation with augmented ordinal depth relationships,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 8, pp. 2674–2682, 2020

2020
[13]

Fs-depth: Focal-and-scale depth estimation from a single image in unseen indoor scene,

C. Wei, M. Yang, L. He, and N. Zheng, “Fs-depth: Focal-and-scale depth estimation from a single image in unseen indoor scene,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 11, pp. 10 604–10 617, 2024

2024
[14]

Real- time free viewpoint video synthesis system based on dibr and a depth estimation network,

S. Guo, J. Hu, K. Zhou, J. Wang, L. Song, R. Xie, and W. Zhang, “Real- time free viewpoint video synthesis system based on dibr and a depth estimation network,”IEEE Transactions on Multimedia, pp. 1–16, 2024

2024
[15]

Fast monocular depth estimation via side prediction ag- gregation with continuous spatial refinement,

J. Wu, R. Ji, Q. Wang, S. Zhang, X. Sun, Y . Wang, M. Xu, and F. Huang, “Fast monocular depth estimation via side prediction ag- gregation with continuous spatial refinement,”IEEE Transactions on Multimedia, vol. 25, pp. 1204–1216, 2023

2023
[16]

Diffusion models beat gans on image synthesis,

P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,”Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021

2021
[17]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

2020
[18]

Repaint: Inpainting using denoising diffusion probabilistic models,

A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, “Repaint: Inpainting using denoising diffusion probabilistic models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 11 461–11 471

2022
[19]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

2022
[20]

Diffumask: Syn- thesizing images with pixel-level annotations for semantic segmentation using diffusion models,

W. Wu, Y . Zhao, M. Z. Shou, H. Zhou, and C. Shen, “Diffumask: Syn- thesizing images with pixel-level annotations for semantic segmentation using diffusion models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 1206–1217

2023
[21]

Diffusion-based image translation with label guidance for domain adaptive semantic segmentation,

D. Peng, P. Hu, Q. Ke, and J. Liu, “Diffusion-based image translation with label guidance for domain adaptive semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 808–820

2023
[22]

Satsynth: Augmenting image-mask pairs through diffusion models for aerial semantic segmentation,

A. Toker, M. Eisenberger, D. Cremers, and L. Leal-Taix ´e, “Satsynth: Augmenting image-mask pairs through diffusion models for aerial semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27 695–27 705

2024
[23]

Resdiff: Combining cnn and diffusion model for image super-resolution,

S. Shang, Z. Shan, G. Liu, L. Wang, X. Wang, Z. Zhang, and J. Zhang, “Resdiff: Combining cnn and diffusion model for image super-resolution,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 8, 2024, pp. 8975–8983. 11

2024
[24]

Sinsr: diffusion-based image super- resolution in a single step,

Y . Wang, W. Yang, X. Chen, Y . Wang, L. Guo, L.-P. Chau, Z. Liu, Y . Qiao, A. C. Kot, and B. Wen, “Sinsr: diffusion-based image super- resolution in a single step,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 25 796–25 805

2024
[25]

Srdiff: Single image super-resolution with diffusion probabilistic mod- els,

H. Li, Y . Yang, M. Chang, S. Chen, H. Feng, Z. Xu, Q. Li, and Y . Chen, “Srdiff: Single image super-resolution with diffusion probabilistic mod- els,”Neurocomputing, vol. 479, pp. 47–59, 2022

2022
[26]

Id-blau: Image deblurring by implicit diffusion-based reblurring aug- mentation,

J.-H. Wu, F.-J. Tsai, Y .-T. Peng, C.-C. Tsai, C.-W. Lin, and Y .-Y . Lin, “Id-blau: Image deblurring by implicit diffusion-based reblurring aug- mentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 25 847–25 856

2024
[27]

Multiscale structure guided diffusion for image deblurring,

M. Ren, M. Delbracio, H. Talebi, G. Gerig, and P. Milanfar, “Multiscale structure guided diffusion for image deblurring,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 721–10 733

2023
[28]

Ecodepth: Effective conditioning of diffusion models for monocular depth estimation,

S. Patni, A. Agarwal, and C. Arora, “Ecodepth: Effective conditioning of diffusion models for monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28 285–28 295

2024
[29]

Evp: Enhanced visual perception using inverse multi-attentive feature refinement and regularized image-text alignment

M. Lavreniuk, S. F. Bhat, M. M ¨uller, and P. Wonka, “Evp: Enhanced visual perception using inverse multi-attentive feature refinement and regularized image-text alignment,”arXiv preprint arXiv:2312.08548, 2023

work page arXiv 2023
[30]

Auto-Encoding Variational Bayes

D. P. Kingma, “Auto-encoding variational bayes,”arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[31]

Repurposing diffusion-based image generators for monoc- ular depth estimation,

B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monoc- ular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 9492– 9502

2024
[32]

Attention attention everywhere: Monocular depth prediction with skip attention,

A. Agarwal and C. Arora, “Attention attention everywhere: Monocular depth prediction with skip attention,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 5861– 5870

2023
[33]

New crfs: Neural window fully-connected crfs for monocular depth estimation

W. Yuan, X. Gu, Z. Dai, S. Zhu, and P. Tan, “New crfs: Neural window fully-connected crfs for monocular depth estimation. arxiv 2022,”arXiv preprint arXiv:2203.01502, 2022

work page arXiv 2022
[34]

Gedepth: Ground embedding for monocular depth estimation,

X. Yang, Z. Ma, Z. Ji, and Z. Ren, “Gedepth: Ground embedding for monocular depth estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 12 719–12 727

2023
[35]

Probabilistic and geometric depth: Detecting objects in perspective,

T. Wang, Z. Xinge, J. Pang, and D. Lin, “Probabilistic and geometric depth: Detecting objects in perspective,” inConference on Robot Learn- ing. PMLR, 2022, pp. 1475–1485

2022
[36]

Va-depthnet: A variational approach to single image depth prediction,

C. Liu, S. Kumar, S. Gu, R. Timofte, and L. Van Gool, “Va-depthnet: A variational approach to single image depth prediction,”arXiv preprint arXiv:2302.06556, 2023

work page arXiv 2023
[37]

Unsupervised monocular estimation of depth and visual odometry using attention and depth-pose consistency loss,

X. Song, H. Hu, L. Liang, W. Shi, G. Xie, X. Lu, and X. Hei, “Unsupervised monocular estimation of depth and visual odometry using attention and depth-pose consistency loss,”IEEE Transactions on Multimedia, vol. 26, pp. 3517–3529, 2024

2024
[38]

Single image depth prediction made better: A multivariate gaussian take,

C. Liu, S. Kumar, S. Gu, R. Timofte, and L. Van Gool, “Single image depth prediction made better: A multivariate gaussian take,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 346–17 356

2023
[39]

Self-supervised monocular depth estimation with frequency-based recurrent refinement,

R. Li, D. Xue, Y . Zhu, H. Wu, J. Sun, and Y . Zhang, “Self-supervised monocular depth estimation with frequency-based recurrent refinement,” IEEE Transactions on Multimedia, vol. 25, pp. 5626–5637, 2023

2023
[40]

Adabins: Depth estimation using adaptive bins,

S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 4009–4018

2021
[41]

Binsformer: Revisiting adaptive bins for monocular depth estimation,

Z. Li, X. Wang, X. Liu, and J. Jiang, “Binsformer: Revisiting adaptive bins for monocular depth estimation,”IEEE Transactions on Image Processing, 2024

2024
[42]

Iebins: Itera- tive elastic bins for monocular depth estimation,

S. Shao, Z. Pei, X. Wu, Z. Liu, W. Chen, and Z. Li, “Iebins: Itera- tive elastic bins for monocular depth estimation,”Advances in Neural Information Processing Systems, vol. 36, 2024

2024
[43]

Localbins: Improving depth estimation by learning local distributions,

S. F. Bhat, I. Alhashim, and P. Wonka, “Localbins: Improving depth estimation by learning local distributions,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 480–496

2022
[44]

Low-light image enhancement with wavelet-based diffusion models,

H. Jiang, A. Luo, H. Fan, S. Han, and S. Liu, “Low-light image enhancement with wavelet-based diffusion models,”ACM Transactions on Graphics (TOG), vol. 42, no. 6, pp. 1–14, 2023

2023
[45]

D 3roma: Disparity diffusion-based depth sensing for material-agnostic robotic manipulation,

S. Wei, H. Geng, J. Chen, C. Deng, C. Wenbo, C. Zhao, X. Fang, L. Guibas, and H. Wang, “D 3roma: Disparity diffusion-based depth sensing for material-agnostic robotic manipulation,” inECCV 2024 Workshop on Wild 3D: 3D Modeling, Reconstruction, and Generation in the Wild, 2024

2024
[46]

Denoising Diffusion Implicit Models

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[47]

Residual de- noising diffusion models,

J. Liu, Q. Wang, H. Fan, Y . Wang, Y . Tang, and L. Qu, “Residual de- noising diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 2773–2783

2024
[48]

Ddp: Diffusion model for dense visual prediction,

Y . Ji, Z. Chen, E. Xie, L. Hong, X. Liu, Z. Liu, T. Lu, Z. Li, and P. Luo, “Ddp: Diffusion model for dense visual prediction,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 21 741–21 752

2023
[49]

NICE: Non-linear Independent Components Estimation

L. Dinh, D. Krueger, and Y . Bengio, “Nice: Non-linear independent components estimation,”arXiv preprint arXiv:1410.8516, 2014

work page internal anchor Pith review arXiv 2014
[50]

Wasserstein generative adversarial networks,

M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” inInternational Conference on Machine Learning, 2017. [Online]. Available: https://api.semanticscholar.org/ CorpusID:2057420

2017
[51]

Depth map prediction from a single image using a multi-scale deep network,

D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,”Advances in neural information processing systems, vol. 27, 2014

2014
[52]

Vision meets robotics: The kitti dataset,

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013

2013
[53]

P3depth: Monocular depth estimation with a piecewise planarity prior,

V . Patil, C. Sakaridis, A. Liniger, and L. Van Gool, “P3depth: Monocular depth estimation with a piecewise planarity prior,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1610–1621

2022
[54]

Deep ordinal regression network for monocular depth estimation,

H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2002–2011

2018
[55]

Urcdc-depth: Uncertainty rectified cross-distillation with cutflip for monocular depth estimation,

S. Shao, Z. Pei, W. Chen, R. Li, Z. Liu, and Z. Li, “Urcdc-depth: Uncertainty rectified cross-distillation with cutflip for monocular depth estimation,”IEEE Transactions on Multimedia, vol. 26, pp. 3341–3353, 2024

2024
[56]

Nddepth: Normal-distance assisted monocular depth estimation,

S. Shao, Z. Pei, W. Chen, X. Wu, and Z. Li, “Nddepth: Normal-distance assisted monocular depth estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7931–7940

2023
[57]

Depthformer: Exploiting long- range correlation and local information for accurate monocular depth estimation,

Z. Li, Z. Chen, X. Liu, and J. Jiang, “Depthformer: Exploiting long- range correlation and local information for accurate monocular depth estimation,”Machine Intelligence Research, vol. 20, no. 6, pp. 837–854, 2023

2023
[58]

Wordepth: Variational language prior for monocular depth estimation,

Z. Zeng, D. Wang, F. Yang, H. Park, S. Soatto, D. Lao, and A. Wong, “Wordepth: Variational language prior for monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9708–9719

2024
[59]

Diffusiondepth: Diffusion denoising approach for monocular depth estimation,

Y . Duan, X. Guo, and Z. Zhu, “Diffusiondepth: Diffusion denoising approach for monocular depth estimation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 432–449

2024
[60]

idisc: Internal discretization for monocular depth estimation,

L. Piccinelli, C. Sakaridis, and F. Yu, “idisc: Internal discretization for monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21 477–21 487

2023
[61]

3d packing for self-supervised monocular depth estimation,

V . Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon, “3d packing for self-supervised monocular depth estimation,” inIEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), 2020

2020
[62]

Unsupervised cnn for single view depth estimation: Geometry to the rescue,

R. Garg, V . K. Bg, G. Carneiro, and I. Reid, “Unsupervised cnn for single view depth estimation: Geometry to the rescue,” inComputer Vision– ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. Springer, 2016, pp. 740–756

2016