pith. machine review for the scientific record. sign in

arxiv: 2605.11578 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

The Midas Touch for Metric Depth

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords depth estimationmetric depthrelative depthdepth completionsparse 3D datagraph optimizationgeodesic cost
0
0 comments X

The pith

MTD turns relative depth estimates into accurate metric depth using only a few 3D points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MTD as a way to add real-world scale to relative depth predictions that lack it. It works by dividing the image into segments, using sparse 3D points to set the right scale for each segment through graph optimization, and then smoothing the result pixel by pixel while respecting edges where depth changes suddenly. This produces better metric depth than prior methods that try to complete or estimate depth directly. The design is simple enough to attach to existing systems for various 3D tasks.

Core claim

MTD is a mathematically interpretable approach that converts relative depth into metric depth using only extremely sparse 3D data. To eliminate local scale inconsistencies, it applies a segment-wise recovery strategy via sparse graph optimization, followed by a pixel-wise refinement strategy using a discontinuity-aware geodesic cost. MTD exhibits strong generalization and achieves substantial accuracy improvements over previous depth completion and depth estimation methods. Moreover, its lightweight, plug-and-play design facilitates deployment and integration on diverse downstream 3D tasks.

What carries the argument

Segment-wise scale recovery via sparse graph optimization followed by pixel-wise refinement with a discontinuity-aware geodesic cost.

If this is right

  • Metric depth becomes available from relative estimators with minimal additional input.
  • Local scale inconsistencies are removed, leading to more consistent 3D models.
  • Accuracy surpasses previous depth completion and estimation techniques.
  • The lightweight design allows easy integration into other 3D vision pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could support real-time metric scaling in robotics by fusing relative depth with occasional sparse measurements.
  • Testing with varying numbers and placements of sparse points would reveal the minimum data needed for reliable segment scales.
  • The two-stage process might extend to video by adding temporal links between segments across frames.

Load-bearing premise

Extremely sparse 3D data points are sufficient and accurate enough to recover consistent metric scales across segments without introducing new errors, and the discontinuity-aware geodesic cost correctly handles all local inconsistencies.

What would settle it

Applying MTD to a relative depth map with known ground-truth metric depth and clustered sparse points, then measuring scale errors in distant segments without nearby points; large errors would show the recovery fails.

Figures

Figures reproduced from arXiv: 2605.11578 by Hanli Wang, Haoran Zhang, Hongbo Zhao, Rui Fan, Yi Feng, Yu Ma, Zizhan Guo, Zuyi Xiong.

Figure 1
Figure 1. Figure 1: Application versatility of MTD in metric depth perception. (a) For novel scenes with extremely sparse point clouds, our method achieves precise depth completion and outperforms existing state-of-the-art methods. (b) To eliminate offline LiDAR point cloud aggregation, our method achieves real-time, online predictions on handheld edge devices, thanks to its low inference time and high accuracy. (c) For commo… view at source ↗
Figure 2
Figure 2. Figure 2: MTD takes relative depth, sparse 3D seeds, and a superpixel segment set as inputs and outputs reliable metric depth. (a) In the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation results on VOID1500 at input resolu [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation on 3D Seed Sparsity and Segment Scale. (a) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Inference time on embedded system (NVIDIA Jet [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Evaluation setup and visualization. (a) We use a [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results on the KITTI Odometry benchmark. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparisons with SoTA depth completion methods. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Illustration of reducing local scale inconsistencies in our [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: RGB-LiDAR misalignment on Argoverse 2. process of Marigold-DC [53] within a reasonable range. Furthermore, [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative results across different depth foundation models on the ETH3D Indoor dataset. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: MAE comparison under different superpixel segmentation algorithms. The subfigures correspond to (a) the Felzenszwalb [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative results for the rectification of commonly [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Additional qualitative results on the KITTI Odometry [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative results on our method for the Radar depth [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗
read the original abstract

Recent advances have markedly improved the cross-scene generalization of relative depth estimation, yet its practical applicability remains limited by the absence of metric scale, local inconsistencies, and low computational efficiency. To address these issues, we present \emph{\textbf{M}idas \textbf{T}ouch for \textbf{D}epth} (MTD), a mathematically interpretable approach that converts relative depth into metric depth using only extremely sparse 3D data. To eliminate local scale inconsistencies, it applies a segment-wise recovery strategy via sparse graph optimization, followed by a pixel-wise refinement strategy using a discontinuity-aware geodesic cost. MTD exhibits strong generalization and achieves substantial accuracy improvements over previous depth completion and depth estimation methods. Moreover, its lightweight, plug-and-play design facilitates deployment and integration on diverse downstream 3D tasks. Project page is available at https://mias.group/MTD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MTD (Midas Touch for Depth), a plug-and-play method that converts relative depth estimates into metric depth using only extremely sparse 3D data. It applies a segment-wise recovery strategy via sparse graph optimization to eliminate local scale inconsistencies, followed by pixel-wise refinement using a discontinuity-aware geodesic cost. The authors claim strong cross-scene generalization and substantial accuracy improvements over prior depth completion and depth estimation methods, along with lightweight design for downstream 3D tasks.

Significance. If the central claims hold with rigorous validation, the approach could meaningfully improve the practical utility of relative depth models by enabling metric-scale output with minimal additional 3D input, facilitating integration into robotics, AR, and other applications where full metric depth is required but dense sensors are unavailable.

major comments (2)
  1. Abstract: the claim of 'substantial accuracy improvements' and 'strong generalization' is asserted without any quantitative results, error bars, ablation studies, or details on sparse-point selection and optimization formulation, leaving the central claim without visible supporting evidence.
  2. Method description (segment-wise recovery): the graph optimization for per-segment scale recovery is described as relying on 'extremely sparse' points, yet no analysis is provided of the minimum point density required per segment, behavior when points lie near discontinuities, or robustness to noise in the sparse input; this directly bears on whether the method eliminates inconsistencies or merely substitutes new artifacts.
minor comments (2)
  1. Abstract: the parenthetical expansion of MTD is given as 'Midas Touch for Depth' but the bolding and spacing in the title rendering are inconsistent with standard LaTeX conventions.
  2. Project page link is provided but no supplementary material or code repository is referenced, which would aid reproducibility of the claimed lightweight design.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the presentation of our claims and the analysis of the method's robustness. We address each major comment below and have made revisions to the manuscript where appropriate to improve clarity and provide additional supporting evidence.

read point-by-point responses
  1. Referee: Abstract: the claim of 'substantial accuracy improvements' and 'strong generalization' is asserted without any quantitative results, error bars, ablation studies, or details on sparse-point selection and optimization formulation, leaving the central claim without visible supporting evidence.

    Authors: We agree that the abstract, due to its brevity, does not include quantitative details or explicit references to supporting experiments. The full manuscript contains extensive quantitative results, ablation studies, and method details in Sections 3 and 4. To address this directly, we have revised the abstract to include key quantitative highlights (e.g., specific accuracy gains on standard benchmarks) and added cross-references to the experimental validation. We have also expanded the method section with more explicit descriptions of sparse-point selection criteria and the graph optimization formulation, ensuring these are tied to the reported results. revision: yes

  2. Referee: Method description (segment-wise recovery): the graph optimization for per-segment scale recovery is described as relying on 'extremely sparse' points, yet no analysis is provided of the minimum point density required per segment, behavior when points lie near discontinuities, or robustness to noise in the sparse input; this directly bears on whether the method eliminates inconsistencies or merely substitutes new artifacts.

    Authors: We appreciate this observation on the need for robustness analysis. The original manuscript includes experimental results with varying levels of sparsity and qualitative examples near discontinuities. However, we acknowledge that a dedicated sensitivity analysis was not sufficiently detailed. We have added a new subsection to the experiments section that analyzes minimum point density per segment, performance when points are near discontinuities, and robustness under input noise. These results confirm that the segment-wise optimization eliminates local inconsistencies without introducing new artifacts, supported by both quantitative metrics and visual comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: method applies standard optimization and geodesic costs to external inputs

full rationale

The paper presents MTD as a plug-and-play conversion pipeline that takes relative depth plus extremely sparse 3D points as inputs, applies segment-wise sparse graph optimization for scale recovery, then pixel-wise refinement via a discontinuity-aware geodesic cost. These operations are described as standard techniques (graph optimization, geodesic distances) whose outputs are not shown to be mathematically identical to any fitted parameter or quantity defined inside the paper itself. No equations equate a 'prediction' to its own fitting objective by construction, no uniqueness theorem is imported from the authors' prior work, and no self-citation chain is load-bearing for the central claim. The derivation chain therefore remains self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are detailed beyond reliance on standard graph optimization and geodesic distance concepts assumed from prior literature.

pith-pipeline@v0.9.0 · 5458 in / 1070 out tokens · 37553 ms · 2026-05-13T01:33:27.827128+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 3 internal anchors

  1. [1]

    SLIC Superpixels Compared to State-of-the-Art Superpixel Methods.IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2274– 2282, 2012

    Radhakrishna Achanta et al. SLIC Superpixels Compared to State-of-the-Art Superpixel Methods.IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2274– 2282, 2012. 3, 15, 17

  2. [2]

    1–a model zoo for robust monocular relative depth estimation

    Reiner Birkl et al. MiDaS v3.1 – A Model Zoo for Ro- bust Monocular Relative Depth Estimation.arXiv preprint arXiv:2307.14460, 2023. 2, 6, 14

  3. [3]

    Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

    Aleksei Bochkovskii et al. Depth pro: Sharp monocu- lar metric depth in less than a second.arXiv preprint arXiv:2410.02073, 2024. 6, 14

  4. [4]

    Virtual KITTI 2

    Yohann Cabon et al. Virtual kitti 2.arXiv preprint arXiv:2001.10773, 2020. 5, 12

  5. [5]

    nuScenes: A Multimodal Dataset for Autonomous Driving

    Holger Caesar et al. nuScenes: A Multimodal Dataset for Autonomous Driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 5, 9, 12

  6. [6]

    Domain Generalized Stereo Matching via Hierarchical Visual Transformation

    Tianyu Chang et al. Domain Generalized Stereo Matching via Hierarchical Visual Transformation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9559–9568, 2023. 15

  7. [7]

    Depth Estimation via Affinity Learned with Convolutional Spatial Propagation Network

    Xinjing Cheng et al. Depth Estimation via Affinity Learned with Convolutional Spatial Propagation Network. InPro- ceedings of the European Conference on Computer Vision (ECCV), pages 103–119, 2018. 14

  8. [8]

    ScanNet: Richly-Annotated 3D Recon- structions of Indoor Scenes

    Angela Dai et al. ScanNet: Richly-Annotated 3D Recon- structions of Indoor Scenes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 5, 9, 12

  9. [9]

    Omnidata: A scalable pipeline for mak- ing multi-task mid-level vision datasets from 3d scans

    Ainaz Eftekhar et al. Omnidata: A scalable pipeline for mak- ing multi-task mid-level vision datasets from 3d scans. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10786–10796, 2021. 14

  10. [10]

    Depth map prediction from a single im- age using a multi-scale deep network.Advances in neural information processing systems (NeurIPS), 27, 2014

    David Eigen et al. Depth map prediction from a single im- age using a multi-scale deep network.Advances in neural information processing systems (NeurIPS), 27, 2014. 12

  11. [11]

    Efficient graph-based image segmentation.International Journal of Computer Vision, 59(2):167–181, 2004

    Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient graph-based image segmentation.International Journal of Computer Vision, 59(2):167–181, 2004. 3, 15, 17

  12. [12]

    ViPOcc: leveraging visual priors from vision foundation models for single-view 3d occupancy prediction

    Yi Feng et al. ViPOcc: leveraging visual priors from vision foundation models for single-view 3d occupancy prediction. InProceedings of the AAAI Conference on Artificial Intelli- gence (AAAI), pages 3004–3012, 2025. 8

  13. [13]

    Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image

    Xiao Fu et al. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. InProceedings of the European Conference on Computer Vision (ECCV), pages 241–258. Springer, 2024. 2, 6, 14

  14. [14]

    GeoBench: Benchmarking and Analyz- ing Monocular Geometry Estimation Models.arXiv preprint arXiv:2406.12671, 2024

    Yongtao Ge et al. GeoBench: Benchmarking and Analyz- ing Monocular Geometry Estimation Models.arXiv preprint arXiv:2406.12671, 2024. 1, 2, 5, 12

  15. [15]

    Are we ready for autonomous driv- ing? the KITTI vision benchmark suite

    Andreas Geiger et al. Are we ready for autonomous driv- ing? the KITTI vision benchmark suite. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3354–3361. IEEE, 2012. 5, 9, 12

  16. [16]

    Unsupervised Monocular Depth Esti- mation With Left-Right Consistency

    Clement Godard et al. Unsupervised Monocular Depth Esti- mation With Left-Right Consistency. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2017. 7

  17. [17]

    Digging Into Self-Supervised Monoc- ular Depth Estimation

    Clement Godard et al. Digging Into Self-Supervised Monoc- ular Depth Estimation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), 2019. 7, 8

  18. [18]

    Neural Markov Random Field for Stereo Matching

    Tongfan Guan et al. Neural Markov Random Field for Stereo Matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5459–5469, 2024. 15

  19. [19]

    3D Packing for Self-Supervised Monoc- ular Depth Estimation

    Vitor Guizilini et al. 3D Packing for Self-Supervised Monoc- ular Depth Estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 5, 9, 12

  20. [20]

    Lotus: Diffusion-based visual foundation model for high-quality dense prediction

    Jing He et al. Lotus: Diffusion-based visual foundation model for high-quality dense prediction.arXiv preprint arXiv:2409.18124, 2024. 2, 6, 14

  21. [21]

    Mu Hu et al. Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2, 6

  22. [22]

    MVSAnywhere: Zero-Shot Multi- View Stereo

    Sergio Izquierdo et al. MVSAnywhere: Zero-Shot Multi- View Stereo. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11493–11504, 2025. 8

  23. [23]

    Uncertainty Guided Adaptive Warping for Robust and Efficient Stereo Matching

    Junpeng Jing et al. Uncertainty Guided Adaptive Warping for Robust and Efficient Stereo Matching. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3318–3327, 2023. 15

  24. [24]

    Is my Depth Ground-Truth Good Enough? HAMMER–Highly Accurate Multi-Modal Dataset for DEnse 3D Scene Regression.arXiv preprint arXiv:2205.04565, 2022

    HyunJun Jung et al. Is my Depth Ground-Truth Good Enough? HAMMER–Highly Accurate Multi-Modal Dataset for DEnse 3D Scene Regression.arXiv preprint arXiv:2205.04565, 2022. 5, 9, 12

  25. [25]

    Repurposing diffusion-based image gen- erators for monocular depth estimation

    Bingxin Ke et al. Repurposing diffusion-based image gen- erators for monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9492–9502, 2024. 2, 5, 6, 7, 12, 14

  26. [26]

    Segment Anything

    Alexander Kirillov et al. Segment Anything. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 4015–4026, 2023. 5, 12

  27. [27]

    Comparison of monocular depth esti- mation methods using geometrically relevant metrics on the IBims-1 dataset.Computer Vision and Image Understand- ing, 191:102877, 2020

    Tobias Koch et al. Comparison of monocular depth esti- mation methods using geometrically relevant metrics on the IBims-1 dataset.Computer Vision and Image Understand- ing, 191:102877, 2020. 5, 9, 12

  28. [28]

    Know Your Neighbors: Improving Single- View Reconstruction via Spatial Vision-Language Reason- ing

    Rui Li et al. Know Your Neighbors: Improving Single- View Reconstruction via Spatial Vision-Language Reason- ing. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 9848– 9858, 2024. 8

  29. [29]

    Superpixel Segmenta- tion Using Linear Spectral Clustering

    Zhengqin Li and Jiansheng Chen. Superpixel Segmenta- tion Using Linear Spectral Clustering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2015. 15, 17

  30. [30]

    Distilling Monocular Foundation Model for Fine-grained Depth Completion

    Yingping Liang et al. Distilling Monocular Foundation Model for Fine-grained Depth Completion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 22254–22265, 2025. 2, 3, 6, 12, 13, 14

  31. [31]

    Prompting Depth Anything for 4K Reso- lution Accurate Metric Depth Estimation

    Haotong Lin et al. Prompting Depth Anything for 4K Reso- lution Accurate Metric Depth Estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17070–17080, 2025. 2, 6, 13

  32. [32]

    Dynamic spatial propagation network for depth completion

    Yuankai Lin et al. Dynamic spatial propagation network for depth completion. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 1638–1646, 2022. 2

  33. [33]

    RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching

    Lahav Lipson et al. RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching. In2021 International Conference on 3D Vision (3DV), pages 218–227, 2021. 15

  34. [34]

    EfficientViT: Memory Efficient Vision Transformer With Cascaded Group Attention

    Xinyu Liu et al. EfficientViT: Memory Efficient Vision Transformer With Cascaded Group Attention. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14420–14430, 2023. 5

  35. [35]

    Non-local spatial propagation network for depth completion

    Jinsun Park et al. Non-local spatial propagation network for depth completion. InProceedings of the European Confer- ence on Computer Vision (ECCV), pages 120–136. Springer,

  36. [36]

    Depth prompting for sensor-agnostic depth estimation

    Jin-Hwi Park et al. Depth prompting for sensor-agnostic depth estimation. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 9859–9869, 2024. 14

  37. [37]

    Unidepth: Universal monocular metric depth estimation

    Luigi Piccinelli et al. Unidepth: Universal monocular metric depth estimation. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 10106–10116, 2024. 2, 12, 14

  38. [38]

    Unidepthv2: Universal monocular metric depth estimation made simpler

    Luigi Piccinelli et al. UniDepthV2: Universal Monocu- lar Metric Depth Estimation Made Simpler.arXiv preprint arXiv:2502.20110, 2025. 2, 6

  39. [39]

    Towards robust monocular depth esti- mation: Mixing datasets for zero-shot cross-dataset transfer

    Ren ´e Ranftl et al. Towards robust monocular depth esti- mation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 44(3):1623–1637, 2020. 2

  40. [40]

    Vision Transformers for Dense Prediction

    Ren ´e Ranftl et al. Vision Transformers for Dense Prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12179–12188, 2021. 6, 12, 14

  41. [41]

    Masked Representation Learning for Domain Generalized Stereo Matching

    Zhibo Rao et al. Masked Representation Learning for Domain Generalized Stereo Matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5435–5444, 2023. 15

  42. [42]

    Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding

    Mike Roberts et al. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 10912–10922, 2021. 5, 12

  43. [43]

    Make3D: Learning 3D Scene Struc- ture from a Single Still Image.IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5):824–840, 2009

    Ashutosh Saxena et al. Make3D: Learning 3D Scene Struc- ture from a Single Still Image.IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5):824–840, 2009. 5, 9, 12

  44. [44]

    A Multi-View Stereo Benchmark With High-Resolution Images and Multi-Camera Videos

    Thomas Schops et al. A Multi-View Stereo Benchmark With High-Resolution Images and Multi-Camera Videos. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 5, 9, 12

  45. [45]

    Indoor segmentation and support inference from rgbd images

    Nathan Silberman et al. Indoor segmentation and support inference from rgbd images. InProceedings of the Euro- pean Conference on Computer Vision (ECCV), pages 746–

  46. [46]

    5, 9, 12

    Springer, 2012. 5, 9, 12

  47. [47]

    Depth Estimation From Camera Image and mmWave Radar Point Cloud

    Akash Deep Singh et al. Depth Estimation From Camera Image and mmWave Radar Point Cloud. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9275–9285, 2023. 12

  48. [48]

    SUN RGB-D: A RGB-D Scene Un- derstanding Benchmark Suite

    Shuran Song et al. SUN RGB-D: A RGB-D Scene Un- derstanding Benchmark Suite. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 5, 9, 12

  49. [49]

    DepthMaster: Taming Diffusion Models for Monocular Depth Estimation

    Ziyang Song, Zerong Wang, Bo Li, Hao Zhang, Ruijie Zhu, Li Liu, Peng-Tao Jiang, and Tianzhu Zhang. Depthmaster: Taming diffusion models for monocular depth estimation. arXiv preprint arXiv:2501.02576, 2025. 2, 6, 12, 14

  50. [50]

    LoFTR: Detector-Free Local Fea- ture Matching With Transformers

    Jiaming Sun et al. LoFTR: Detector-Free Local Fea- ture Matching With Transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8922–8931, 2021. 17

  51. [51]

    Bilateral propagation network for depth com- pletion

    Jie Tang et al. Bilateral propagation network for depth com- pletion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9763–9772, 2024. 2, 3, 6, 12, 13, 14

  52. [52]

    Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neu- ral information processing systems (NeurIPS), 34:16558– 16569, 2021

    Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neu- ral information processing systems (NeurIPS), 34:16558– 16569, 2021. 8, 17

  53. [53]

    Dai, Andrea F

    Igor Vasiljevic et al. Diode: A dense indoor and outdoor depth dataset.arXiv preprint arXiv:1908.00463, 2019. 5, 9, 12

  54. [54]

    Marigold-DC: Zero-Shot Monoc- ular Depth Completion with Guided Diffusion

    Massimiliano Viola et al. Marigold-DC: Zero-Shot Monoc- ular Depth Completion with Guided Diffusion. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 5359–5370, 2025. 2, 3, 5, 6, 12, 14

  55. [55]

    G2-MonoDepth: A General Framework of Generalized Depth Inference From Monocular RGB+X Data.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):3753–3771, 2024

    Haotian Wang et al. G2-MonoDepth: A General Framework of Generalized Depth Inference From Monocular RGB+X Data.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):3753–3771, 2024. 14

  56. [56]

    VGGT: Visual Geometry Grounded Transformer

    Jianyuan Wang et al. VGGT: Visual Geometry Grounded Transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5294–5306, 2025. 8

  57. [57]

    TartanAir: A Dataset to Push the Limits of Visual SLAM

    Wenshan Wang et al. TartanAir: A Dataset to Push the Limits of Visual SLAM. In2020 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), pages 4909– 4916, 2020. 5, 12

  58. [58]

    Selective-Stereo: Adaptive Frequency In- formation Selection for Stereo Matching

    Xianqi Wang et al. Selective-Stereo: Adaptive Frequency In- formation Selection for Stereo Matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19701–19710, 2024. 15

  59. [59]

    LRRU: Long-short Range Recurrent Up- dating Networks for Depth Completion

    Yufei Wang et al. LRRU: Long-short Range Recurrent Up- dating Networks for Depth Completion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9422–9432, 2023. 2, 6, 12

  60. [60]

    Behind the scenes: Density fields for single view reconstruction

    Felix Wimbauer et al. Behind the scenes: Density fields for single view reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9076–9086, 2023. 8

  61. [61]

    Unsupervised depth completion from visual inertial odometry.IEEE Robotics and Automation Letters, 5 (2):1899–1906, 2020

    Alex Wong et al. Unsupervised depth completion from visual inertial odometry.IEEE Robotics and Automation Letters, 5 (2):1899–1906, 2020. 5, 9, 12

  62. [62]

    Tinyvit: Fast pretraining distillation for small vision transformers

    Kan Wu et al. Tinyvit: Fast pretraining distillation for small vision transformers. InProceedings of the European Con- ference on Computer Vision (ECCV), pages 68–85. Springer,

  63. [63]

    EfficientSAM: Leveraged Masked Im- age Pretraining for Efficient Segment Anything

    Yunyang Xiong et al. EfficientSAM: Leveraged Masked Im- age Pretraining for Efficient Segment Anything. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16111–16121, 2024. 5

  64. [64]

    Iterative Geometry Encoding V olume for Stereo Matching

    Gangwei Xu et al. Iterative Geometry Encoding V olume for Stereo Matching. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 21919–21928, 2023. 15

  65. [65]

    IGEV++: Iterative Multi-Range Geome- try Encoding V olumes for Stereo Matching.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 47(8): 7108–7122, 2025

    Gangwei Xu et al. IGEV++: Iterative Multi-Range Geome- try Encoding V olumes for Stereo Matching.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 47(8): 7108–7122, 2025. 15

  66. [66]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang et al. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10371–10381, 2024. 2, 7

  67. [67]

    Depth anything v2.Advances in Neural In- formation Processing Systems (NeurIPS), 37:21875–21911,

    Lihe Yang et al. Depth anything v2.Advances in Neural In- formation Processing Systems (NeurIPS), 37:21875–21911,

  68. [68]

    Diversedepth: Affine-invariant depth predic- tion using diverse data.arXiv preprint arXiv:2002.00569,

    Wei Yin et al. Diversedepth: Affine-invariant depth predic- tion using diverse data.arXiv preprint arXiv:2002.00569,

  69. [69]

    Learning To Recover 3D Scene Shape From a Single Image

    Wei Yin et al. Learning To Recover 3D Scene Shape From a Single Image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 204–213, 2021. 6, 14

  70. [70]

    Metric3d: Towards zero-shot metric 3d predic- tion from a single image

    Wei Yin et al. Metric3d: Towards zero-shot metric 3d predic- tion from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9043–9053, 2023. 2

  71. [71]

    pixelNeRF: Neural Radiance Fields From One or Few Images

    Yu, Alex and others. pixelNeRF: Neural Radiance Fields From One or Few Images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4578–4587, 2021. 8

  72. [72]

    Hierarchical normalization for robust monocular depth estimation.Advances in neural informa- tion processing systems (NeurIPS), 35:14128–14139, 2022

    Chi Zhang et al. Hierarchical normalization for robust monocular depth estimation.Advances in neural informa- tion processing systems (NeurIPS), 35:14128–14139, 2022. 14

  73. [73]

    Faster segment anything: Towards lightweight sam for mobile applications,

    Chaoning Zhang et al. Faster segment anything: Towards lightweight sam for mobile applications.arXiv preprint arXiv:2306.14289, 2023. 3, 5

  74. [74]

    Domain-invariant stereo matching net- works

    Feihu Zhang et al. Domain-invariant stereo matching net- works. InProceedings of the European Conference on Com- puter Vision (ECCV), pages 420–439. Springer, 2020. 15

  75. [75]

    DCPI-Depth: Explicitly Infusing Dense Correspondence Prior to Unsupervised Monocular Depth Estimation.IEEE Transactions on Image Processing, 34:4258–4272, 2025

    Mengtan Zhang et al. DCPI-Depth: Explicitly Infusing Dense Correspondence Prior to Unsupervised Monocular Depth Estimation.IEEE Transactions on Image Processing, 34:4258–4272, 2025. 7

  76. [76]

    Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation

    Ning Zhang et al. Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 18537–18546, 2023. 12

  77. [77]

    CompletionFormer: Depth Completion With Convolutions and Vision Transformers

    Youmin Zhang et al. CompletionFormer: Depth Completion With Convolutions and Vision Transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 18527–18536, 2023. 2, 6, 12

  78. [78]

    Learning representations from foun- dation models for domain generalized stereo matching

    Yongjian Zhang et al. Learning representations from foun- dation models for domain generalized stereo matching. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 146–162. Springer, 2024. 15

  79. [79]

    OGNI-DC: Robust depth com- pletion with optimization-guided neural iterations

    Yiming Zuo and Jia Deng. OGNI-DC: Robust depth com- pletion with optimization-guided neural iterations. InPro- ceedings of the European Conference on Computer Vision (ECCV), pages 78–95. Springer, 2024. 2, 12