pith. machine review for the scientific record. sign in

arxiv: 2604.06576 · v1 · submitted 2026-04-08 · 💻 cs.CV · eess.IV

Recognition: 2 theorem links

· Lean Theorem

LiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:05 UTC · model grok-4.3

classification 💻 cs.CV eess.IV
keywords monocular depth estimationsubspace representationframe theorylifting theorydepth binsedge enhancementtransformer model
0
0 comments X

The pith

LiftFormer maps image features into depth-binned geometric subspaces to turn monocular depth estimation into direct representation matching.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes LiftFormer, which reformulates monocular depth estimation by lifting spatial image features into an intermediate depth-oriented geometric representation subspace built from frame theory. In this subspace the transformed features align directly with depth bin values, bridging color-based inputs to geometric depth outputs. A second edge-aware representation subspace refines predictions where depth changes sharply. Experiments show state-of-the-art results on standard datasets, with ablation confirming that both lifting steps contribute to the gains.

Core claim

A DGR subspace is built from linearly dependent vectors tied to depth bins to give a redundant representation; image features are mapped into it so they correspond directly to depth values. An ER subspace is constructed in parallel so that depth features can be used to boost local accuracy around edges.

What carries the argument

The two lifting modules that embed features into the DGR subspace (frame-theory depth bins) and the ER subspace (edge sharpening) to convert color-to-depth learning into subspace representation.

If this is right

  • Depth prediction reduces to learning and matching representations inside a redundant, depth-binned space.
  • Edge artifacts drop because the ER subspace supplies targeted local enhancement.
  • The redundant frame-theory basis increases robustness to small variations in input features.
  • Performance scales with standard monocular depth benchmarks without extra post-processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same lifting pattern could be tested on related dense tasks such as surface normal estimation or instance depth.
  • Replacing the backbone with other modern vision transformers would isolate whether the subspace modules alone drive the reported gains.
  • Theoretical bounds on representation stability could be derived from the frame-theory construction to predict failure modes on new scene types.

Load-bearing premise

That features placed in the DGR subspace will map to correct depth values without losing global consistency or creating new boundary errors.

What would settle it

An experiment in which removing the DGR or ER lifting module leaves or improves accuracy on the same test sets and metrics.

Figures

Figures reproduced from arXiv: 2604.06576 by Chong Lv, Chuankun Li, Huibin Bai, Hui Yuan, Shuai Li, Tian Xie, Wei Hua, Yanbo Gao.

Figure 1
Figure 1. Figure 1: Illustration of the feature flow in our LiftFormer. Existing MDE [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed LiftFormer architecture. The image spatial features are lifted to the depth-oriented geometric representation (DGR) subspace [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the SF-DGR subspace transformation -based lifting [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: t-SNE visualization of the encoder and DGR features in our LiftFormer versus the decoder features obtained by PixelFormer [23] at two scales (top: [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of the DF-ER subspace transformation -based lifting [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the ER coefficients obtained on the KITTI dataset [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results of the proposed LiftFormer in comparison with those of the PixelFormer [23] on the KITTI dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative results of the proposed LiftFormer in comparison with those of PixelFormer [23] on the NYUV2 dataset. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Error map visualization on the KITTI dataset. The first column shows the error map of PixelFormer [23]. The second column shows the error map [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
read the original abstract

Monocular depth estimation (MDE) has attracted increasing interest in the past few years, owing to its important role in 3D vision. MDE is the estimation of a depth map from a monocular image/video to represent the 3D structure of a scene, which is a highly ill-posed problem. To solve this problem, in this paper, we propose a LiftFormer based on lifting theory topology, for constructing an intermediate subspace that bridges the image color features and depth values, and a subspace that enhances the depth prediction around edges. MDE is formulated by transforming the depth value prediction problem into depth-oriented geometric representation (DGR) subspace feature representation, thus bridging the learning from color values to geometric depth values. A DGR subspace is constructed based on frame theory by using linearly dependent vectors in accordance with depth bins to provide a redundant and robust representation. The image spatial features are transformed into the DGR subspace, where these features correspond directly to the depth values. Moreover, considering that edges usually present sharp changes in a depth map and tend to be erroneously predicted, an edge-aware representation (ER) subspace is constructed, where depth features are transformed and further used to enhance the local features around edges. The experimental results demonstrate that our LiftFormer achieves state-of-the-art performance on widely used datasets, and an ablation study validates the effectiveness of both proposed lifting modules in our LiftFormer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes LiftFormer for monocular depth estimation by reformulating depth prediction as a mapping of image spatial features into a depth-oriented geometric representation (DGR) subspace constructed via frame theory from linearly dependent vectors aligned with depth bins, thereby bridging color features to geometric depth values. An additional edge-aware representation (ER) subspace is introduced to transform depth features and enhance local predictions around edges. The paper claims this yields state-of-the-art performance on standard datasets, with ablations confirming the effectiveness of the two lifting modules.

Significance. If the asserted direct correspondence property of the DGR subspace can be rigorously derived from frame theory and the empirical gains prove robust and reproducible, the work could introduce a principled geometric embedding technique that improves interpretability and edge accuracy in depth estimation networks beyond standard encoder-decoder designs.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'the image spatial features are transformed into the DGR subspace, where these features correspond directly to the depth values' is asserted without derivation. No equations or frame-theoretic properties (e.g., inner-product monotonicity with depth bins, invertibility of the redundant representation, or guarantees against artifacts) are supplied to show why the chosen linearly dependent frame vectors produce this direct correspondence rather than a generic embedding. This mapping is load-bearing for the lifting modules and for interpreting the ablation results as validation of the approach.
  2. [Abstract] Abstract: The statements that 'our LiftFormer achieves state-of-the-art performance on widely used datasets' and that 'an ablation study validates the effectiveness of both proposed lifting modules' are presented without any quantitative metrics, error bars, dataset names, baseline comparisons, or statistical tests. The full manuscript must supply these (including tables of Abs Rel, RMSE, etc.) to allow assessment of whether the subspace constructions actually drive the reported gains.
minor comments (2)
  1. [Abstract] The abstract refers to 'lifting theory topology' without a specific reference or brief explanation of the lifting scheme employed; a citation or short definition would improve accessibility.
  2. The free parameters (number of depth bins, subspace dimensionality) are introduced without discussion of sensitivity or selection criteria; a brief analysis or default values would strengthen reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript accordingly to improve theoretical rigor and clarity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'the image spatial features are transformed into the DGR subspace, where these features correspond directly to the depth values' is asserted without derivation. No equations or frame-theoretic properties (e.g., inner-product monotonicity with depth bins, invertibility of the redundant representation, or guarantees against artifacts) are supplied to show why the chosen linearly dependent frame vectors produce this direct correspondence rather than a generic embedding. This mapping is load-bearing for the lifting modules and for interpreting the ablation results as validation of the approach.

    Authors: We acknowledge that the abstract presents the direct correspondence claim without an explicit derivation. The manuscript's Section 3.2 describes the DGR subspace construction via frame theory using linearly dependent vectors aligned with depth bins, but we agree a self-contained derivation of key properties (inner-product monotonicity, invertibility of the redundant frame, and artifact bounds) is not provided. In the revised version we will insert a dedicated subsection with the required equations and frame-theoretic arguments to establish why this yields direct depth correspondence rather than a generic embedding. revision: yes

  2. Referee: [Abstract] Abstract: The statements that 'our LiftFormer achieves state-of-the-art performance on widely used datasets' and that 'an ablation study validates the effectiveness of both proposed lifting modules' are presented without any quantitative metrics, error bars, dataset names, baseline comparisons, or statistical tests. The full manuscript must supply these (including tables of Abs Rel, RMSE, etc.) to allow assessment of whether the subspace constructions actually drive the reported gains.

    Authors: The full manuscript already contains Table 1 with Abs Rel, Sq Rel, RMSE, and other metrics on KITTI and NYU Depth V2 together with baseline comparisons, and Table 2 with the ablation results for the two lifting modules. Error bars are omitted following common practice in the field, but we will add a reproducibility note. To directly address the abstract, we will insert concise quantitative highlights (key metrics, datasets, and main baselines) into the revised abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; formulation is a modeling choice with external validation

full rationale

The paper formulates MDE as a transformation into DGR and ER subspaces constructed via frame theory with vectors aligned to depth bins, asserting that features 'correspond directly to the depth values' within this representation. This is presented as the core architectural proposal rather than a derived equality that reduces to inputs by construction. No equations are supplied that equate the claimed correspondence to a fitted parameter or prior self-result. SOTA performance and ablation results are reported on standard external datasets (e.g., widely used benchmarks), providing independent empirical grounding. No load-bearing self-citations, uniqueness theorems, or renamings of known results are evident in the provided text that would force the central claims. The derivation chain remains self-contained as a proposed lifting-based architecture.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claim rests on the unproven applicability of lifting theory topology to create a bridging subspace between color features and depth values, plus the assumption that frame theory with per-bin vectors yields a robust redundant representation; no independent evidence for these mappings is supplied in the abstract.

free parameters (2)
  • number of depth bins
    Used to define the linearly dependent vectors that construct the DGR subspace
  • subspace dimensionality
    Determines the size of the redundant representation but not quantified in the abstract
axioms (2)
  • domain assumption Lifting theory topology can construct an intermediate subspace that directly bridges image color features and geometric depth values
    Invoked as the basis for the DGR construction in the abstract
  • domain assumption Frame theory with linearly dependent vectors per depth bin provides a redundant and robust representation for depth prediction
    Stated as the mechanism that allows features to correspond directly to depth values
invented entities (2)
  • Depth-oriented geometric representation (DGR) subspace no independent evidence
    purpose: To transform image features into a space where they correspond directly to depth values
    Newly constructed using frame theory and depth bins
  • Edge-aware representation (ER) subspace no independent evidence
    purpose: To enhance depth features around edges where sharp changes occur
    Newly constructed to address common prediction errors at boundaries

pith-pipeline@v0.9.0 · 5578 in / 1730 out tokens · 79928 ms · 2026-05-10T19:05:25.679224+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

79 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    Unsupervised monocular depth estimation using attention and multi-warp reconstruction,

    C. Ling, X. Zhang, and H. Chen, “Unsupervised monocular depth estimation using attention and multi-warp reconstruction,”IEEE Trans- actions on Multimedia, vol. 24, pp. 2938–2949, 2021

  2. [2]

    Bayesian denet: Monocular depth prediction and frame-wise fusion with synchronized uncertainty,

    X. Yang, Y . Gao, H. Luo, C. Liao, and K.-T. Cheng, “Bayesian denet: Monocular depth prediction and frame-wise fusion with synchronized uncertainty,”IEEE Transactions on Multimedia, vol. 21, no. 11, pp. 2701–2713, 2019

  3. [3]

    Digging into self-supervised monocular depth estimation,

    C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 3828– 3838

  4. [4]

    Fast monocular depth estimation via side prediction ag- gregation with continuous spatial refinement,

    J. Wu, R. Ji, Q. Wang, S. Zhang, X. Sun, Y . Wang, M. Xu, and F. Huang, “Fast monocular depth estimation via side prediction ag- gregation with continuous spatial refinement,”IEEE Transactions on Multimedia, vol. 25, pp. 1204–1216, 2023

  5. [5]

    Vision transformers for dense prediction,

    R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12 179–12 188

  6. [6]

    Distortion-aware self-supervised indoor 360 ◦ depth estimation via hybrid projection fusion and structural regularities,

    X. Wang, W. Kong, Q. Zhang, Y . Yang, T. Zhao, and J. Jiang, “Distortion-aware self-supervised indoor 360 ◦ depth estimation via hybrid projection fusion and structural regularities,”IEEE Transactions on Multimedia, vol. 26, pp. 3998–4011, 2024

  7. [7]

    Towards comprehensive monocular depth estimation: Multiple heads are better than one,

    S. Shao, R. Li, Z. Pei, Z. Liu, W. Chen, W. Zhu, X. Wu, and B. Zhang, “Towards comprehensive monocular depth estimation: Multiple heads are better than one,”IEEE Transactions on Multimedia, vol. 25, pp. 7660–7671, 2023

  8. [8]

    3d packing for self-supervised monocular depth estimation,

    V . Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon, “3d packing for self-supervised monocular depth estimation,” inIEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), 2020

  9. [9]

    Shape- preserving object depth control for stereoscopic images,

    J. Lei, B. Peng, C. Zhang, X. Mei, X. Cao, X. Fan, and X. Li, “Shape- preserving object depth control for stereoscopic images,”IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 12, pp. 3333–3344, 2018

  10. [10]

    Real- time free viewpoint video synthesis system based on dibr and a depth estimation network,

    S. Guo, J. Hu, K. Zhou, J. Wang, L. Song, R. Xie, and W. Zhang, “Real- time free viewpoint video synthesis system based on dibr and a depth estimation network,”IEEE Transactions on Multimedia, pp. 1–16, 2024

  11. [11]

    Depth-assisted joint detection network for monocular 3d object detection,

    J. Lei, T. Guo, B. Peng, and C. Yu, “Depth-assisted joint detection network for monocular 3d object detection,” in2021 IEEE International Conference on Image Processing (ICIP). IEEE, 2021, pp. 2204–2208

  12. [12]

    A novel framework for pothole area estimation based on object detection and monocular metric depth estimation,

    D. Wang, Y . Xu, H. Zhu, and K. Liu, “A novel framework for pothole area estimation based on object detection and monocular metric depth estimation,” in2024 IEEE International Conference on Signal, Information and Data Processing (ICSIDP), 2024, pp. 1–6

  13. [13]

    The interpretation of structure from motion,

    S. Ullman, “The interpretation of structure from motion,”Proceedings of the Royal Society of London. Series B. Biological Sciences, vol. 203, no. 1153, pp. 405–426, 1979

  14. [14]

    Hmm-based sur- face reconstruction from single images,

    T. Nagai, T. Naruse, M. Ikehara, and A. Kurematsu, “Hmm-based sur- face reconstruction from single images,” inProceedings. International Conference on Image Processing, vol. 2. IEEE, 2002, pp. II–II

  15. [15]

    Depth map prediction from a single image using a multi-scale deep network,

    D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,”Advances in neural information processing systems, vol. 27, 2014

  16. [16]

    Neural window fully- connected crfs for monocular depth estimation,

    W. Yuan, X. Gu, Z. Dai, S. Zhu, and P. Tan, “Neural window fully- connected crfs for monocular depth estimation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3916–3925

  17. [17]

    Bevdepth: Acquisition of reliable depth for multi-view 3d object detec- tion,

    Y . Li, Z. Ge, G. Yu, J. Yang, Z. Wang, Y . Shi, J. Sun, and Z. Li, “Bevdepth: Acquisition of reliable depth for multi-view 3d object detec- tion,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1477–1485

  18. [18]

    Single image depth prediction made better: A multivariate gaussian take,

    C. Liu, S. Kumar, S. Gu, R. Timofte, and L. Van Gool, “Single image depth prediction made better: A multivariate gaussian take,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 346–17 356

  19. [19]

    Panoformer: Panorama transformer for indoor 360° depth estimation,

    Z. Shen, C. Lin, K. Liao, L. Nie, Z. Zheng, and Y . Zhao, “Panoformer: Panorama transformer for indoor 360° depth estimation,” inEuropean Conference on Computer Vision, 2022

  20. [20]

    Deep ordinal regression network for monocular depth estimation,

    H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2002–2011

  21. [21]

    Adabins: Depth estimation using adaptive bins,

    S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4009–4018

  22. [22]

    Binsformer: Revisiting adaptive bins for monocular depth estimation

    Z. Li, X. Wang, X. Liu, and J. Jiang, “Binsformer: Revisiting adaptive bins for monocular depth estimation,”arXiv preprint arXiv:2204.00987, 2022

  23. [23]

    Attention attention everywhere: Monocular depth prediction with skip attention,

    A. Agarwal and C. Arora, “Attention attention everywhere: Monocular depth prediction with skip attention,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 5861– 5870

  24. [24]

    Depthformer: Exploiting long- range correlation and local information for accurate monocular depth estimation,

    Z. Li, Z. Chen, X. Liu, and J. Jiang, “Depthformer: Exploiting long- range correlation and local information for accurate monocular depth estimation,”Machine Intelligence Research, pp. 1–18, 2023

  25. [25]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  26. [26]

    Convit: Improving vision transformers with soft con- volutional inductive biases,

    S. d’Ascoli, H. Touvron, M. L. Leavitt, A. S. Morcos, G. Biroli, and L. Sagun, “Convit: Improving vision transformers with soft con- volutional inductive biases,” inInternational conference on machine learning. PMLR, 2021, pp. 2286–2296

  27. [27]

    Ro- bust transformer with locality inductive bias and feature normalization,

    O. N. Manzari, H. Kashiani, H. A. Dehkordi, and S. B. Shokouhi, “Ro- bust transformer with locality inductive bias and feature normalization,” Engineering Science and Technology, an International Journal, vol. 38, p. 101320, 2023

  28. [28]

    Cvt: Introducing convolutions to vision transformers,

    H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, “Cvt: Introducing convolutions to vision transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 22–31

  29. [29]

    Cmt: Convolutional neural networks meet vision transformers,

    J. Guo, K. Han, H. Wu, Y . Tang, X. Chen, Y . Wang, and C. Xu, “Cmt: Convolutional neural networks meet vision transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12 175–12 185

  30. [30]

    Self-supervised monocular depth estimation with multi-constraints,

    X. Yang, S. Zhang, and B. Zhao, “Self-supervised monocular depth estimation with multi-constraints,” in2021 40th Chinese Control Con- ference (CCC), 2021, pp. 8422–8427. 12

  31. [31]

    Structure-aware residual pyramid network for monocular depth estimation,

    X. Chen, X. Chen, and Z.-J. Zha, “Structure-aware residual pyramid network for monocular depth estimation,” inProceedings of the 28th International Joint Conference on Artificial Intelligence, ser. IJCAI’19. AAAI Press, 2019, p. 694–700

  32. [32]

    From big to small: Multi-scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2019

    J. H. Lee, M.-K. Han, D. W. Ko, and I. H. Suh, “From big to small: Multi-scale local planar guidance for monocular depth estimation,”arXiv preprint arXiv:1907.10326, 2019

  33. [33]

    Urcdc-depth: Uncertainty rectified cross-distillation with cutflip for monocular depth estimation,

    S. Shao, Z. Pei, W. Chen, R. Li, Z. Liu, and Z. Li, “Urcdc-depth: Uncertainty rectified cross-distillation with cutflip for monocular depth estimation,”IEEE Transactions on Multimedia, vol. 26, pp. 3341–3353, 2024

  34. [34]

    Va-depthnet: A variational approach to single image depth prediction,

    C. Liu, S. Kumar, S. Gu, R. Timofte, and L. Van Gool, “Va-depthnet: A variational approach to single image depth prediction,”International Conference on Learning Representations (ICLR), Kigali, Rwanda, May 1-5, 2023

  35. [35]

    Unsupervised monocular estimation of depth and visual odometry using attention and depth-pose consistency loss,

    X. Song, H. Hu, L. Liang, W. Shi, G. Xie, X. Lu, and X. Hei, “Unsupervised monocular estimation of depth and visual odometry using attention and depth-pose consistency loss,”IEEE Transactions on Multimedia, vol. 26, pp. 3517–3529, 2024

  36. [36]

    idisc: Internal discretization for monocular depth estimation,

    L. Piccinelli, C. Sakaridis, and F. Yu, “idisc: Internal discretization for monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21 477–21 487

  37. [37]

    Self-supervised monocular depth estimation with frequency-based recurrent refinement,

    R. Li, D. Xue, Y . Zhu, H. Wu, J. Sun, and Y . Zhang, “Self-supervised monocular depth estimation with frequency-based recurrent refinement,” IEEE Transactions on Multimedia, vol. 25, pp. 5626–5637, 2023

  38. [38]

    Laplacian pyramid neural network for dense continuous-value regression for complex scenes,

    X. Chen, X. Chen, Y . Zhang, X. Fu, and Z.-J. Zha, “Laplacian pyramid neural network for dense continuous-value regression for complex scenes,”IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 11, pp. 5034–5046, 2021

  39. [39]

    Rddepth: A lightweight algorithm for monocular depth estimation,

    G. Xiong, J. Qi, Y . Peng, Y . Ping, and C. Wu, “Rddepth: A lightweight algorithm for monocular depth estimation,” in2024 4th International Conference on Computer, Control and Robotics (ICCCR), 2024, pp. 26– 30

  40. [40]

    Deep convolutional neural fields for depth estimation from a single image,

    F. Liu, C. Shen, and G. Lin, “Deep convolutional neural fields for depth estimation from a single image,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5162–5170

  41. [41]

    Fastdepth: Fast monocular depth estimation on embedded systems,

    D. Wofk, F. Ma, T.-J. Yang, S. Karaman, and V . Sze, “Fastdepth: Fast monocular depth estimation on embedded systems,” in2019 Interna- tional Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 6101–6108

  42. [42]

    Learning depth from monocular videos using direct methods,

    C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey, “Learning depth from monocular videos using direct methods,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2022–2030

  43. [43]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,”International Conference on Learning Representations (ICLR), Austria, May 3-7, 2021

  44. [44]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022

  45. [45]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255

  46. [46]

    Simmim: A simple framework for masked image modeling,

    Z. Xie, Z. Zhang, Y . Cao, Y . Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu, “Simmim: A simple framework for masked image modeling,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9653–9663

  47. [47]

    Geonet: Geometric neural network for joint depth and surface normal estimation,

    X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia, “Geonet: Geometric neural network for joint depth and surface normal estimation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 283–291

  48. [48]

    Ddp: Diffusion model for dense visual prediction,

    Y . Ji, Z. Chen, E. Xie, L. Hong, X. Liu, Z. Liu, T. Lu, Z. Li, and P. Luo, “Ddp: Diffusion model for dense visual prediction,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 21 741–21 752

  49. [49]

    Scalable diffusion models with transformers,

    W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4195–4205

  50. [50]

    Depth anything: Unleashing the power of large-scale unlabeled data,

    L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 371–10 381

  51. [51]

    Fine-grained semantics-aware represen- tation enhancement for self-supervised monocular depth estimation,

    H. Jung, E. Park, and S. Yoo, “Fine-grained semantics-aware represen- tation enhancement for self-supervised monocular depth estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 642–12 652

  52. [52]

    Joint task- recursive learning for semantic segmentation and depth estimation,

    Z. Zhang, Z. Cui, C. Xu, Z. Jie, X. Li, and J. Yang, “Joint task- recursive learning for semantic segmentation and depth estimation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 235–251

  53. [53]

    Localbins: Improving depth estimation by learning local distributions,

    S. F. Bhat, I. Alhashim, and P. Wonka, “Localbins: Improving depth estimation by learning local distributions,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 480–496

  54. [54]

    ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. M ¨uller, “Zoedepth: Zero-shot transfer by combining relative and metric depth,”arXiv preprint arXiv:2302.12288, 2023

  55. [55]

    Learning depth from single monocular images,

    A. Saxena, S. Chung, and A. Ng, “Learning depth from single monocular images,”Advances in neural information processing systems, vol. 18, 2005

  56. [56]

    Indoor robot navigation with single camera vision

    G. C. Gini, A. Marchiet al., “Indoor robot navigation with single camera vision.”PRIS, vol. 2, pp. 67–76, 2002

  57. [57]

    New algorithms from reconstruction of a 3-d depth map from one or more images,

    M. Shao, T. Simchony, and R. Chellappa, “New algorithms from reconstruction of a 3-d depth map from one or more images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 1988, pp. 530–531

  58. [58]

    Blur-aware disparity estimation from defocus stereo images,

    C.-H. Chen, H. Zhou, and T. Ahonen, “Blur-aware disparity estimation from defocus stereo images,” inProceedings of the IEEE International Conference on Computer Vision, 2015, pp. 855–863

  59. [59]

    An introduction to frames,

    J. Kova ˇcevi´c, A. Chebiraet al., “An introduction to frames,”Foundations and Trends® in Signal Processing, vol. 2, no. 1, pp. 1–94, 2008

  60. [60]

    Christensenet al.,An introduction to frames and Riesz bases

    O. Christensenet al.,An introduction to frames and Riesz bases. Springer, 2003, vol. 7

  61. [61]

    t-distributed stochastic neighbor embedding (t-sne): A tool for eco-physiological transcriptomic analysis,

    M. C. Cieslak, A. M. Castelfranco, V . Roncalli, P. H. Lenz, and D. K. Hartline, “t-distributed stochastic neighbor embedding (t-sne): A tool for eco-physiological transcriptomic analysis,”Marine genomics, vol. 51, p. 100723, 2020

  62. [62]

    Casenet: Deep category-aware semantic edge detection,

    Z. Yu, C. Feng, M.-Y . Liu, and S. Ramalingam, “Casenet: Deep category-aware semantic edge detection,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5964– 5973

  63. [63]

    Mind the edge: Refining depth edges in sparsely-supervised monocular depth estimation,

    L. Talker, A. Cohen, E. Yosef, A. Dana, and M. Dinerstein, “Mind the edge: Refining depth edges in sparsely-supervised monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 10 606–10 616

  64. [64]

    Re- thinking bisenet for real-time semantic segmentation,

    M. Fan, S. Lai, J. Huang, X. Wei, Z. Chai, J. Luo, and X. Wei, “Re- thinking bisenet for real-time semantic segmentation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 9716–9725

  65. [65]

    Edge boxes: Locating object proposals from edges,

    C. L. Zitnick and P. Doll ´ar, “Edge boxes: Locating object proposals from edges,” inComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 391–405

  66. [66]

    Monocular depth estimation with adaptive geometric attention,

    T. Naderi, A. Sadovnik, J. Hayward, and H. Qi, “Monocular depth estimation with adaptive geometric attention,” inIEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022, pp. 617– 627

  67. [67]

    Focal-wnet: An architecture uni- fying convolution and attention for depth estimation,

    G. Manimaran and J. Swaminathan, “Focal-wnet: An architecture uni- fying convolution and attention for depth estimation,” in2022 IEEE 7th International conference for Convergence in Technology (I2CT). IEEE, 2022, pp. 1–7

  68. [68]

    P3depth: Monocular depth estimation with a piecewise planarity prior,

    V . Patil, C. Sakaridis, A. Liniger, and L. Van Gool, “P3depth: Monocular depth estimation with a piecewise planarity prior,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1610–1621

  69. [69]

    Self-supervised joint learning framework of depth estimation via implicit cues,

    J. Wang, G. Zhang, Z. Wu, X. Li, and L. Liu, “Self-supervised joint learning framework of depth estimation via implicit cues,”arXiv preprint arXiv:2006.09876, 2020

  70. [70]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. Q. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . B. Huang, S.-W. Li, I. Misra, M. G. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. J ´egou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features w...

  71. [71]

    Available: https://api.semanticscholar.org/CorpusID: 258170077

    [Online]. Available: https://api.semanticscholar.org/CorpusID: 258170077

  72. [72]

    Spatial pyramid pooling in deep convolutional networks for visual recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,”IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904– 1916, 2015. 13

  73. [73]

    Irondepth: Iterative refinement of single-view depth using surface normal and its uncertainty,

    G. Bae, I. Budvytis, and R. Cipolla, “Irondepth: Iterative refinement of single-view depth using surface normal and its uncertainty,” inBritish Machine Vision Conference (BMVC), 2022

  74. [74]

    Meta-optimization for higher model generalizability in single-image depth prediction,

    C.-Y . Wu, Y . Zhong, J. Wang, and U. Neumann, “Meta-optimization for higher model generalizability in single-image depth prediction,” International Conference on Learning Representations (ICLR), Kigali, Rwanda, May 1-5, 2023

  75. [75]

    Improving deep regression with ordinal entropy,

    S. Zhang, L. Yang, M. B. Mi, X. Zheng, and A. Yao, “Improving deep regression with ordinal entropy,”International Conference on Learning Representations (ICLR), Kigali, Rwanda, May 1-5, 2023

  76. [76]

    Vision meets robotics: The kitti dataset,

    A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013

  77. [77]

    Unsupervised cnn for single view depth estimation: Geometry to the rescue,

    R. Garg, V . K. Bg, G. Carneiro, and I. Reid, “Unsupervised cnn for single view depth estimation: Geometry to the rescue,” inComputer Vision– ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. Springer, 2016, pp. 740–756

  78. [78]

    Indoor segmentation and support inference from rgbd images,

    N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” inECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12. Springer, 2012, pp. 746–760

  79. [79]

    Patch-wise attention network for monocular depth estimation,

    S. Lee, J. Lee, B. Kim, E. Yi, and J. Kim, “Patch-wise attention network for monocular depth estimation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, 2021, pp. 1873–1881