Recognition: unknown
Monocular Depth Estimation From the Perspective of Feature Restoration: A Diffusion Enhanced Depth Restoration Approach
Pith reviewed 2026-05-10 17:24 UTC · model grok-4.3
The pith
Treating encoder features as degraded and restoring them via invertible diffusion produces more accurate monocular depth estimates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By viewing pretrained encoder features in monocular depth estimation as degraded inputs and applying an Invertible Transform-enhanced Indirect Diffusion module to restore them, the method achieves higher accuracy in predicted depth maps. The bi-Lipschitz condition in the decoder ensures stability during the diffusion iterations despite lacking direct feature supervision. Adding the Auxiliary Viewpoint-based Low-level Feature Enhancement module further refines local details. This results in superior performance compared to prior methods on standard benchmarks like KITTI.
What carries the argument
The InvT-IndDiffusion module that performs diffusion-based restoration of encoder features under indirect supervision from the depth map, stabilized by an invertible transform satisfying the bi-Lipschitz condition.
If this is right
- The proposed method outperforms state-of-the-art approaches on various datasets.
- On the KITTI benchmark the RMSE improves by 4.09 percent and 37.77 percent under different training settings compared to the baseline.
- The AV-LFE module enhances local details in a plug-and-play manner when an auxiliary viewpoint is available.
- Substantial potential exists in current encoder-decoder architectures once encoder features are properly restored.
Where Pith is reading between the lines
- Similar restoration techniques could be tested on related tasks such as surface normal estimation or optical flow to check transferability.
- The stability provided by the invertible transform suggests that diffusion processes in vision tasks benefit from structure-preserving decoders.
- If the indirect supervision works well here, it may allow training depth models with less direct supervision in other settings.
Load-bearing premise
That the features from a pretrained encoder are degraded versions of some ground-truth feature whose corresponding depth matches the actual ground truth, allowing the diffusion to be guided effectively by the final depth output alone.
What would settle it
Running the baseline encoder-decoder without the InvT-IndDiffusion module and observing whether the RMSE on KITTI remains the same or worsens compared to the full method.
Figures
read the original abstract
Monocular Depth Estimation (MDE) is a fundamental computer vision task with important applications in 3D vision. The current mainstream MDE methods employ an encoder-decoder architecture with multi-level/scale feature processing. However, the limitations of the current architecture and the effects of different-level features on the prediction accuracy are not evaluated. In this paper, we first investigate the above problem and show that there is still substantial potential in the current framework if encoder features can be improved. Therefore, we propose to formulate the depth estimation problem from the feature restoration perspective, by treating pretrained encoder features as degraded features of an assumed ground truth feature that yields the ground truth depth map. Then an Invertible Transform-enhanced Indirect Diffusion (InvT-IndDiffusion) module is developed for feature restoration. Due to the absence of direct supervision on feature, only indirect supervision from the final sparse depth map is used. During the iterative procedure of diffusion, this results in feature deviations among steps. The proposed InvT-IndDiffusion solves this problem by using an invertible transform-based decoder under the bi-Lipschitz condition. Finally, a plug-and-play Auxiliary Viewpoint-based Low-level Feature Enhancement module (AV-LFE) is developed to enhance local details with auxiliary viewpoint when available. Experiments demonstrate that the proposed method achieves better performance than the state-of-the-art methods on various datasets. Specifically on the KITTI benchmark, compared with the baseline, the performance is improved by 4.09% and 37.77% under different training settings in terms of RMSE. Code is available at https://github.com/whitehb1/IID-RDepth.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reframes monocular depth estimation as restoring degraded pretrained encoder features to an assumed ground-truth feature via an Invertible Transform-enhanced Indirect Diffusion (InvT-IndDiffusion) module that uses only indirect supervision from the final sparse depth map, augmented by a plug-and-play Auxiliary Viewpoint-based Low-level Feature Enhancement (AV-LFE) module; it reports state-of-the-art results with RMSE gains of 4.09% and 37.77% on KITTI under different training settings.
Significance. If the reported gains are shown to arise specifically from controlled feature restoration rather than training differences or the auxiliary module, the work would meaningfully advance MDE by demonstrating that diffusion-based restoration can improve encoder features in standard architectures and by supplying a practical enhancement module.
major comments (2)
- [Method (InvT-IndDiffusion module)] Method section on InvT-IndDiffusion: the claim that the invertible transform decoder under the bi-Lipschitz condition reliably prevents step-to-step feature deviations is unsupported by either a proof that the learned decoder satisfies the condition or any empirical measurement of feature trajectory stability; given the many-to-one feature-to-depth mapping and purely indirect supervision from the sparse depth loss, gradients may steer restored features toward any depth-reducing configuration rather than the intended ground-truth restoration.
- [Experiments] Experiments: the abstract reports 4.09% and 37.77% RMSE improvements on KITTI versus the baseline, yet no ablation studies isolate the contribution of InvT-IndDiffusion from AV-LFE, no error bars or multi-run statistics are supplied, and baseline details are insufficient to rule out unstated training differences as the source of the gains.
minor comments (2)
- [Abstract] Abstract: the statement that the method achieves 'better performance than the state-of-the-art methods on various datasets' should name the specific competing methods and list all metrics used, not only RMSE.
- [Method] Notation: the diffusion noise schedule, step count, and invertible transform parameters are listed as free parameters but receive no explicit discussion of how they are chosen or ablated.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional analysis and experiments as outlined.
read point-by-point responses
-
Referee: [Method (InvT-IndDiffusion module)] Method section on InvT-IndDiffusion: the claim that the invertible transform decoder under the bi-Lipschitz condition reliably prevents step-to-step feature deviations is unsupported by either a proof that the learned decoder satisfies the condition or any empirical measurement of feature trajectory stability; given the many-to-one feature-to-depth mapping and purely indirect supervision from the sparse depth loss, gradients may steer restored features toward any depth-reducing configuration rather than the intended ground-truth restoration.
Authors: We appreciate the referee highlighting this gap. The current manuscript does not contain a formal proof that the learned decoder satisfies the bi-Lipschitz condition, nor does it report empirical measurements of feature trajectory stability across diffusion steps. The InvT-IndDiffusion module employs an invertible transform decoder whose design draws on the information-preserving properties of bi-Lipschitz mappings commonly used in flow-based models to constrain deviations during the iterative restoration process. To address the concern directly, we will add a new subsection with empirical measurements (e.g., L2 distances between consecutive restored features) comparing trajectories with and without the invertible decoder. Regarding the many-to-one mapping and indirect supervision, the diffusion is initialized from the pretrained encoder features and guided iteratively by the depth loss; the invertible decoder enforces reversibility, which we argue limits arbitrary drift toward depth-reducing configurations. We will expand the method description to clarify this inductive bias and acknowledge the limitations of purely indirect supervision. revision: yes
-
Referee: [Experiments] Experiments: the abstract reports 4.09% and 37.77% RMSE improvements on KITTI versus the baseline, yet no ablation studies isolate the contribution of InvT-IndDiffusion from AV-LFE, no error bars or multi-run statistics are supplied, and baseline details are insufficient to rule out unstated training differences as the source of the gains.
Authors: We agree that the experimental validation needs to be strengthened to isolate component contributions and demonstrate statistical reliability. In the revised version we will add ablation tables that evaluate the full model, the model without InvT-IndDiffusion, the model without AV-LFE, and the baseline, all under identical training protocols. We will also perform at least three independent training runs for the main results and report mean RMSE with standard deviation (error bars) on KITTI. Finally, we will expand the experimental setup section with complete baseline implementation details, including exact hyper-parameters, optimizer settings, data augmentation, and training schedules, to enable direct reproduction and exclude unstated training differences as the source of the reported gains. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper formulates MDE as feature restoration by assumption, introduces InvT-IndDiffusion and AV-LFE modules, and trains them end-to-end using indirect depth-map supervision. Reported gains (4.09% and 37.77% RMSE on KITTI) are empirical benchmark results, not identities obtained by substituting fitted parameters or self-citations back into the output. No equation reduces the final depth to a hyperparameter by construction, no load-bearing self-citation chain exists, and the bi-Lipschitz claim is an architectural assertion rather than a tautology. The method therefore remains externally falsifiable on standard datasets.
Axiom & Free-Parameter Ledger
free parameters (2)
- Diffusion noise schedule and step count
- Invertible transform parameters
axioms (2)
- domain assumption Pretrained encoder features are degraded versions of an ideal ground-truth feature that would produce the ground-truth depth map
- ad hoc to paper Indirect supervision from the final depth map is adequate to train the diffusion restoration despite the absence of direct feature-level supervision
invented entities (2)
-
InvT-IndDiffusion module
no independent evidence
-
AV-LFE module
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Self-supervised multi-frame monocular depth estimation for dynamic scenes,
G. Wu, H. Liu, L. Wang, K. Li, Y . Guo, and Z. Chen, “Self-supervised multi-frame monocular depth estimation for dynamic scenes,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 6, pp. 4989–5001, 2024
2024
-
[2]
Towards comprehensive monocular depth estimation: Multiple heads are better than one,
S. Shao, R. Li, Z. Pei, Z. Liu, W. Chen, W. Zhu, X. Wu, and B. Zhang, “Towards comprehensive monocular depth estimation: Multiple heads are better than one,”IEEE Transactions on Multimedia, vol. 25, pp. 7660–7671, 2023
2023
-
[3]
As-deformable-as- possible single-image-based view synthesis without depth prior,
C. Zhang, C. Lin, K. Liao, L. Nie, and Y . Zhao, “As-deformable-as- possible single-image-based view synthesis without depth prior,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 8, pp. 3989–4001, 2023
2023
-
[4]
A depth-bin- based graphical model for fast view synthesis distortion estimation,
J. Jin, J. Liang, Y . Zhao, C. Lin, C. Yao, and A. Wang, “A depth-bin- based graphical model for fast view synthesis distortion estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 6, pp. 1754–1766, 2019
2019
-
[5]
Unsupervised monocular depth estimation using attention and multi-warp reconstruction,
C. Ling, X. Zhang, and H. Chen, “Unsupervised monocular depth estimation using attention and multi-warp reconstruction,”IEEE Trans- actions on Multimedia, vol. 24, pp. 2938–2949, 2021
2021
-
[6]
Distortion-aware self-supervised indoor 360 ◦ depth estimation via hybrid projection fusion and structural regularities,
X. Wang, W. Kong, Q. Zhang, Y . Yang, T. Zhao, and J. Jiang, “Distortion-aware self-supervised indoor 360 ◦ depth estimation via hybrid projection fusion and structural regularities,”IEEE Transactions on Multimedia, vol. 26, pp. 3998–4011, 2024
2024
-
[7]
Bayesian denet: Monocular depth prediction and frame-wise fusion with synchronized uncertainty,
X. Yang, Y . Gao, H. Luo, C. Liao, and K.-T. Cheng, “Bayesian denet: Monocular depth prediction and frame-wise fusion with synchronized uncertainty,”IEEE Transactions on Multimedia, vol. 21, no. 11, pp. 2701–2713, 2019
2019
-
[8]
Repurposing diffusion-based image generators for monoc- ular depth estimation,
B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monoc- ular depth estimation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 9492–9502
2024
-
[9]
Neural contourlet network for monocular 360° depth estimation,
Z. Shen, C. Lin, L. Nie, K. Liao, and Y . Zhao, “Neural contourlet network for monocular 360° depth estimation,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 12, pp. 8574– 8585, 2022
2022
-
[10]
Ha-bins: Hierarchical adaptive bins for robust monocular depth estimation across multiple datasets,
R. Zhu, Z. Song, L. Liu, J. He, T. Zhang, and Y . Zhang, “Ha-bins: Hierarchical adaptive bins for robust monocular depth estimation across multiple datasets,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 6, pp. 4354–4366, 2024
2024
-
[11]
Liftformer: Lifting and frame theory based monocular depth estima- tion using depth and edge oriented subspace representation,
S. Li, H. Bai, Y . Gao, C. Lv, H. Yuan, C. Li, W. Hua, and T. Xie, “Liftformer: Lifting and frame theory based monocular depth estima- tion using depth and edge oriented subspace representation,” inIEEE Transactions on Multimedia, 2025
2025
-
[12]
Monocular depth estimation with augmented ordinal depth relationships,
Y . Cao, T. Zhao, K. Xian, C. Shen, Z. Cao, and S. Xu, “Monocular depth estimation with augmented ordinal depth relationships,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 8, pp. 2674–2682, 2020
2020
-
[13]
Fs-depth: Focal-and-scale depth estimation from a single image in unseen indoor scene,
C. Wei, M. Yang, L. He, and N. Zheng, “Fs-depth: Focal-and-scale depth estimation from a single image in unseen indoor scene,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 11, pp. 10 604–10 617, 2024
2024
-
[14]
Real- time free viewpoint video synthesis system based on dibr and a depth estimation network,
S. Guo, J. Hu, K. Zhou, J. Wang, L. Song, R. Xie, and W. Zhang, “Real- time free viewpoint video synthesis system based on dibr and a depth estimation network,”IEEE Transactions on Multimedia, pp. 1–16, 2024
2024
-
[15]
Fast monocular depth estimation via side prediction ag- gregation with continuous spatial refinement,
J. Wu, R. Ji, Q. Wang, S. Zhang, X. Sun, Y . Wang, M. Xu, and F. Huang, “Fast monocular depth estimation via side prediction ag- gregation with continuous spatial refinement,”IEEE Transactions on Multimedia, vol. 25, pp. 1204–1216, 2023
2023
-
[16]
Diffusion models beat gans on image synthesis,
P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,”Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021
2021
-
[17]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020
2020
-
[18]
Repaint: Inpainting using denoising diffusion probabilistic models,
A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, “Repaint: Inpainting using denoising diffusion probabilistic models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 11 461–11 471
2022
-
[19]
High- resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695
2022
-
[20]
Diffumask: Syn- thesizing images with pixel-level annotations for semantic segmentation using diffusion models,
W. Wu, Y . Zhao, M. Z. Shou, H. Zhou, and C. Shen, “Diffumask: Syn- thesizing images with pixel-level annotations for semantic segmentation using diffusion models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 1206–1217
2023
-
[21]
Diffusion-based image translation with label guidance for domain adaptive semantic segmentation,
D. Peng, P. Hu, Q. Ke, and J. Liu, “Diffusion-based image translation with label guidance for domain adaptive semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 808–820
2023
-
[22]
Satsynth: Augmenting image-mask pairs through diffusion models for aerial semantic segmentation,
A. Toker, M. Eisenberger, D. Cremers, and L. Leal-Taix ´e, “Satsynth: Augmenting image-mask pairs through diffusion models for aerial semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27 695–27 705
2024
-
[23]
Resdiff: Combining cnn and diffusion model for image super-resolution,
S. Shang, Z. Shan, G. Liu, L. Wang, X. Wang, Z. Zhang, and J. Zhang, “Resdiff: Combining cnn and diffusion model for image super-resolution,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 8, 2024, pp. 8975–8983. 11
2024
-
[24]
Sinsr: diffusion-based image super- resolution in a single step,
Y . Wang, W. Yang, X. Chen, Y . Wang, L. Guo, L.-P. Chau, Z. Liu, Y . Qiao, A. C. Kot, and B. Wen, “Sinsr: diffusion-based image super- resolution in a single step,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 25 796–25 805
2024
-
[25]
Srdiff: Single image super-resolution with diffusion probabilistic mod- els,
H. Li, Y . Yang, M. Chang, S. Chen, H. Feng, Z. Xu, Q. Li, and Y . Chen, “Srdiff: Single image super-resolution with diffusion probabilistic mod- els,”Neurocomputing, vol. 479, pp. 47–59, 2022
2022
-
[26]
Id-blau: Image deblurring by implicit diffusion-based reblurring aug- mentation,
J.-H. Wu, F.-J. Tsai, Y .-T. Peng, C.-C. Tsai, C.-W. Lin, and Y .-Y . Lin, “Id-blau: Image deblurring by implicit diffusion-based reblurring aug- mentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 25 847–25 856
2024
-
[27]
Multiscale structure guided diffusion for image deblurring,
M. Ren, M. Delbracio, H. Talebi, G. Gerig, and P. Milanfar, “Multiscale structure guided diffusion for image deblurring,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 721–10 733
2023
-
[28]
Ecodepth: Effective conditioning of diffusion models for monocular depth estimation,
S. Patni, A. Agarwal, and C. Arora, “Ecodepth: Effective conditioning of diffusion models for monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28 285–28 295
2024
-
[29]
M. Lavreniuk, S. F. Bhat, M. M ¨uller, and P. Wonka, “Evp: Enhanced visual perception using inverse multi-attentive feature refinement and regularized image-text alignment,”arXiv preprint arXiv:2312.08548, 2023
-
[30]
Auto-Encoding Variational Bayes
D. P. Kingma, “Auto-encoding variational bayes,”arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[31]
Repurposing diffusion-based image generators for monoc- ular depth estimation,
B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monoc- ular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 9492– 9502
2024
-
[32]
Attention attention everywhere: Monocular depth prediction with skip attention,
A. Agarwal and C. Arora, “Attention attention everywhere: Monocular depth prediction with skip attention,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 5861– 5870
2023
-
[33]
New crfs: Neural window fully-connected crfs for monocular depth estimation
W. Yuan, X. Gu, Z. Dai, S. Zhu, and P. Tan, “New crfs: Neural window fully-connected crfs for monocular depth estimation. arxiv 2022,”arXiv preprint arXiv:2203.01502, 2022
-
[34]
Gedepth: Ground embedding for monocular depth estimation,
X. Yang, Z. Ma, Z. Ji, and Z. Ren, “Gedepth: Ground embedding for monocular depth estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 12 719–12 727
2023
-
[35]
Probabilistic and geometric depth: Detecting objects in perspective,
T. Wang, Z. Xinge, J. Pang, and D. Lin, “Probabilistic and geometric depth: Detecting objects in perspective,” inConference on Robot Learn- ing. PMLR, 2022, pp. 1475–1485
2022
-
[36]
Va-depthnet: A variational approach to single image depth prediction,
C. Liu, S. Kumar, S. Gu, R. Timofte, and L. Van Gool, “Va-depthnet: A variational approach to single image depth prediction,”arXiv preprint arXiv:2302.06556, 2023
-
[37]
Unsupervised monocular estimation of depth and visual odometry using attention and depth-pose consistency loss,
X. Song, H. Hu, L. Liang, W. Shi, G. Xie, X. Lu, and X. Hei, “Unsupervised monocular estimation of depth and visual odometry using attention and depth-pose consistency loss,”IEEE Transactions on Multimedia, vol. 26, pp. 3517–3529, 2024
2024
-
[38]
Single image depth prediction made better: A multivariate gaussian take,
C. Liu, S. Kumar, S. Gu, R. Timofte, and L. Van Gool, “Single image depth prediction made better: A multivariate gaussian take,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 346–17 356
2023
-
[39]
Self-supervised monocular depth estimation with frequency-based recurrent refinement,
R. Li, D. Xue, Y . Zhu, H. Wu, J. Sun, and Y . Zhang, “Self-supervised monocular depth estimation with frequency-based recurrent refinement,” IEEE Transactions on Multimedia, vol. 25, pp. 5626–5637, 2023
2023
-
[40]
Adabins: Depth estimation using adaptive bins,
S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 4009–4018
2021
-
[41]
Binsformer: Revisiting adaptive bins for monocular depth estimation,
Z. Li, X. Wang, X. Liu, and J. Jiang, “Binsformer: Revisiting adaptive bins for monocular depth estimation,”IEEE Transactions on Image Processing, 2024
2024
-
[42]
Iebins: Itera- tive elastic bins for monocular depth estimation,
S. Shao, Z. Pei, X. Wu, Z. Liu, W. Chen, and Z. Li, “Iebins: Itera- tive elastic bins for monocular depth estimation,”Advances in Neural Information Processing Systems, vol. 36, 2024
2024
-
[43]
Localbins: Improving depth estimation by learning local distributions,
S. F. Bhat, I. Alhashim, and P. Wonka, “Localbins: Improving depth estimation by learning local distributions,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 480–496
2022
-
[44]
Low-light image enhancement with wavelet-based diffusion models,
H. Jiang, A. Luo, H. Fan, S. Han, and S. Liu, “Low-light image enhancement with wavelet-based diffusion models,”ACM Transactions on Graphics (TOG), vol. 42, no. 6, pp. 1–14, 2023
2023
-
[45]
D 3roma: Disparity diffusion-based depth sensing for material-agnostic robotic manipulation,
S. Wei, H. Geng, J. Chen, C. Deng, C. Wenbo, C. Zhao, X. Fang, L. Guibas, and H. Wang, “D 3roma: Disparity diffusion-based depth sensing for material-agnostic robotic manipulation,” inECCV 2024 Workshop on Wild 3D: 3D Modeling, Reconstruction, and Generation in the Wild, 2024
2024
-
[46]
Denoising Diffusion Implicit Models
J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[47]
Residual de- noising diffusion models,
J. Liu, Q. Wang, H. Fan, Y . Wang, Y . Tang, and L. Qu, “Residual de- noising diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 2773–2783
2024
-
[48]
Ddp: Diffusion model for dense visual prediction,
Y . Ji, Z. Chen, E. Xie, L. Hong, X. Liu, Z. Liu, T. Lu, Z. Li, and P. Luo, “Ddp: Diffusion model for dense visual prediction,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 21 741–21 752
2023
-
[49]
NICE: Non-linear Independent Components Estimation
L. Dinh, D. Krueger, and Y . Bengio, “Nice: Non-linear independent components estimation,”arXiv preprint arXiv:1410.8516, 2014
work page internal anchor Pith review arXiv 2014
-
[50]
Wasserstein generative adversarial networks,
M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” inInternational Conference on Machine Learning, 2017. [Online]. Available: https://api.semanticscholar.org/ CorpusID:2057420
2017
-
[51]
Depth map prediction from a single image using a multi-scale deep network,
D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,”Advances in neural information processing systems, vol. 27, 2014
2014
-
[52]
Vision meets robotics: The kitti dataset,
A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013
2013
-
[53]
P3depth: Monocular depth estimation with a piecewise planarity prior,
V . Patil, C. Sakaridis, A. Liniger, and L. Van Gool, “P3depth: Monocular depth estimation with a piecewise planarity prior,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1610–1621
2022
-
[54]
Deep ordinal regression network for monocular depth estimation,
H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2002–2011
2018
-
[55]
Urcdc-depth: Uncertainty rectified cross-distillation with cutflip for monocular depth estimation,
S. Shao, Z. Pei, W. Chen, R. Li, Z. Liu, and Z. Li, “Urcdc-depth: Uncertainty rectified cross-distillation with cutflip for monocular depth estimation,”IEEE Transactions on Multimedia, vol. 26, pp. 3341–3353, 2024
2024
-
[56]
Nddepth: Normal-distance assisted monocular depth estimation,
S. Shao, Z. Pei, W. Chen, X. Wu, and Z. Li, “Nddepth: Normal-distance assisted monocular depth estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7931–7940
2023
-
[57]
Depthformer: Exploiting long- range correlation and local information for accurate monocular depth estimation,
Z. Li, Z. Chen, X. Liu, and J. Jiang, “Depthformer: Exploiting long- range correlation and local information for accurate monocular depth estimation,”Machine Intelligence Research, vol. 20, no. 6, pp. 837–854, 2023
2023
-
[58]
Wordepth: Variational language prior for monocular depth estimation,
Z. Zeng, D. Wang, F. Yang, H. Park, S. Soatto, D. Lao, and A. Wong, “Wordepth: Variational language prior for monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9708–9719
2024
-
[59]
Diffusiondepth: Diffusion denoising approach for monocular depth estimation,
Y . Duan, X. Guo, and Z. Zhu, “Diffusiondepth: Diffusion denoising approach for monocular depth estimation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 432–449
2024
-
[60]
idisc: Internal discretization for monocular depth estimation,
L. Piccinelli, C. Sakaridis, and F. Yu, “idisc: Internal discretization for monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21 477–21 487
2023
-
[61]
3d packing for self-supervised monocular depth estimation,
V . Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon, “3d packing for self-supervised monocular depth estimation,” inIEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), 2020
2020
-
[62]
Unsupervised cnn for single view depth estimation: Geometry to the rescue,
R. Garg, V . K. Bg, G. Carneiro, and I. Reid, “Unsupervised cnn for single view depth estimation: Geometry to the rescue,” inComputer Vision– ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. Springer, 2016, pp. 740–756
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.