arxiv: 2604.06576 · v1 · submitted 2026-04-08 · 💻 cs.CV · eess.IV

Recognition: 2 theorem links

· Lean Theorem

LiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation

Shuai Li , Huibin Bai , Yanbo Gao , Chong Lv , Hui Yuan , Chuankun Li , Wei Hua , Tian Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:05 UTC · model grok-4.3

classification 💻 cs.CV eess.IV

keywords monocular depth estimationsubspace representationframe theorylifting theorydepth binsedge enhancementtransformer model

0 comments

The pith

LiftFormer maps image features into depth-binned geometric subspaces to turn monocular depth estimation into direct representation matching.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes LiftFormer, which reformulates monocular depth estimation by lifting spatial image features into an intermediate depth-oriented geometric representation subspace built from frame theory. In this subspace the transformed features align directly with depth bin values, bridging color-based inputs to geometric depth outputs. A second edge-aware representation subspace refines predictions where depth changes sharply. Experiments show state-of-the-art results on standard datasets, with ablation confirming that both lifting steps contribute to the gains.

Core claim

A DGR subspace is built from linearly dependent vectors tied to depth bins to give a redundant representation; image features are mapped into it so they correspond directly to depth values. An ER subspace is constructed in parallel so that depth features can be used to boost local accuracy around edges.

What carries the argument

The two lifting modules that embed features into the DGR subspace (frame-theory depth bins) and the ER subspace (edge sharpening) to convert color-to-depth learning into subspace representation.

If this is right

Depth prediction reduces to learning and matching representations inside a redundant, depth-binned space.
Edge artifacts drop because the ER subspace supplies targeted local enhancement.
The redundant frame-theory basis increases robustness to small variations in input features.
Performance scales with standard monocular depth benchmarks without extra post-processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same lifting pattern could be tested on related dense tasks such as surface normal estimation or instance depth.
Replacing the backbone with other modern vision transformers would isolate whether the subspace modules alone drive the reported gains.
Theoretical bounds on representation stability could be derived from the frame-theory construction to predict failure modes on new scene types.

Load-bearing premise

That features placed in the DGR subspace will map to correct depth values without losing global consistency or creating new boundary errors.

What would settle it

An experiment in which removing the DGR or ER lifting module leaves or improves accuracy on the same test sets and metrics.

Figures

Figures reproduced from arXiv: 2604.06576 by Chong Lv, Chuankun Li, Huibin Bai, Hui Yuan, Shuai Li, Tian Xie, Wei Hua, Yanbo Gao.

**Figure 2.** Figure 2: Overview of the proposed LiftFormer architecture. The image spatial features are lifted to the depth-oriented geometric representation (DGR) subspace [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the SF-DGR subspace transformation -based lifting [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: t-SNE visualization of the encoder and DGR features in our LiftFormer versus the decoder features obtained by PixelFormer [23] at two scales (top: [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of the DF-ER subspace transformation -based lifting [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of the ER coefficients obtained on the KITTI dataset [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results of the proposed LiftFormer in comparison with those of the PixelFormer [23] on the KITTI dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative results of the proposed LiftFormer in comparison with those of PixelFormer [23] on the NYUV2 dataset. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Error map visualization on the KITTI dataset. The first column shows the error map of PixelFormer [23]. The second column shows the error map [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

read the original abstract

Monocular depth estimation (MDE) has attracted increasing interest in the past few years, owing to its important role in 3D vision. MDE is the estimation of a depth map from a monocular image/video to represent the 3D structure of a scene, which is a highly ill-posed problem. To solve this problem, in this paper, we propose a LiftFormer based on lifting theory topology, for constructing an intermediate subspace that bridges the image color features and depth values, and a subspace that enhances the depth prediction around edges. MDE is formulated by transforming the depth value prediction problem into depth-oriented geometric representation (DGR) subspace feature representation, thus bridging the learning from color values to geometric depth values. A DGR subspace is constructed based on frame theory by using linearly dependent vectors in accordance with depth bins to provide a redundant and robust representation. The image spatial features are transformed into the DGR subspace, where these features correspond directly to the depth values. Moreover, considering that edges usually present sharp changes in a depth map and tend to be erroneously predicted, an edge-aware representation (ER) subspace is constructed, where depth features are transformed and further used to enhance the local features around edges. The experimental results demonstrate that our LiftFormer achieves state-of-the-art performance on widely used datasets, and an ablation study validates the effectiveness of both proposed lifting modules in our LiftFormer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LiftFormer tries a frame-theory subspace for depth bins plus a lifting-based edge subspace, but the direct feature-to-depth mapping is asserted without derivation.

read the letter

The main takeaway is that this paper proposes a subspace construction for monocular depth estimation that mixes frame theory for redundant depth-bin vectors with lifting theory for edge enhancement, yet the step claiming image features 'correspond directly' to depth values after the transform is not derived or justified in the provided text. The new element is the specific split into a DGR subspace built from linearly dependent frame vectors per depth bin and a separate ER subspace meant to sharpen edges. That combination is not in the referenced prior work and could be a useful way to inject geometric structure into what is otherwise a standard learned mapping. The paper does address a known weak point—edge accuracy in depth maps—by isolating an edge-aware representation, which is a reasonable practical focus. The soft spots are more substantial. The abstract states that the transformation creates direct correspondence and that experiments reach SOTA with validating ablations, but supplies no equations showing how inner products or coefficients relate monotonically or invertibly to actual depth, no dataset names, no error metrics, and no error bars. Without those, the performance claims rest on unshown properties of the frame construction. The stress-test concern about the undemonstrated mapping property holds up on what is here; it is not a minor detail because the whole argument depends on it. The circularity risk is also real if the subspaces are just fitted on the same data used for testing. This paper is for computer-vision researchers already working on depth estimation who want to explore theory-driven subspaces rather than pure end-to-end networks. A reader who cares about edge fidelity or geometric embeddings could extract value from the full version if it contains the missing derivations and numbers. I would send it for peer review so the math and results can be checked properly rather than desk-rejecting on the abstract alone.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes LiftFormer for monocular depth estimation by reformulating depth prediction as a mapping of image spatial features into a depth-oriented geometric representation (DGR) subspace constructed via frame theory from linearly dependent vectors aligned with depth bins, thereby bridging color features to geometric depth values. An additional edge-aware representation (ER) subspace is introduced to transform depth features and enhance local predictions around edges. The paper claims this yields state-of-the-art performance on standard datasets, with ablations confirming the effectiveness of the two lifting modules.

Significance. If the asserted direct correspondence property of the DGR subspace can be rigorously derived from frame theory and the empirical gains prove robust and reproducible, the work could introduce a principled geometric embedding technique that improves interpretability and edge accuracy in depth estimation networks beyond standard encoder-decoder designs.

major comments (2)

[Abstract] Abstract: The central claim that 'the image spatial features are transformed into the DGR subspace, where these features correspond directly to the depth values' is asserted without derivation. No equations or frame-theoretic properties (e.g., inner-product monotonicity with depth bins, invertibility of the redundant representation, or guarantees against artifacts) are supplied to show why the chosen linearly dependent frame vectors produce this direct correspondence rather than a generic embedding. This mapping is load-bearing for the lifting modules and for interpreting the ablation results as validation of the approach.
[Abstract] Abstract: The statements that 'our LiftFormer achieves state-of-the-art performance on widely used datasets' and that 'an ablation study validates the effectiveness of both proposed lifting modules' are presented without any quantitative metrics, error bars, dataset names, baseline comparisons, or statistical tests. The full manuscript must supply these (including tables of Abs Rel, RMSE, etc.) to allow assessment of whether the subspace constructions actually drive the reported gains.

minor comments (2)

[Abstract] The abstract refers to 'lifting theory topology' without a specific reference or brief explanation of the lifting scheme employed; a citation or short definition would improve accessibility.
The free parameters (number of depth bins, subspace dimensionality) are introduced without discussion of sensitivity or selection criteria; a brief analysis or default values would strengthen reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript accordingly to improve theoretical rigor and clarity.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'the image spatial features are transformed into the DGR subspace, where these features correspond directly to the depth values' is asserted without derivation. No equations or frame-theoretic properties (e.g., inner-product monotonicity with depth bins, invertibility of the redundant representation, or guarantees against artifacts) are supplied to show why the chosen linearly dependent frame vectors produce this direct correspondence rather than a generic embedding. This mapping is load-bearing for the lifting modules and for interpreting the ablation results as validation of the approach.

Authors: We acknowledge that the abstract presents the direct correspondence claim without an explicit derivation. The manuscript's Section 3.2 describes the DGR subspace construction via frame theory using linearly dependent vectors aligned with depth bins, but we agree a self-contained derivation of key properties (inner-product monotonicity, invertibility of the redundant frame, and artifact bounds) is not provided. In the revised version we will insert a dedicated subsection with the required equations and frame-theoretic arguments to establish why this yields direct depth correspondence rather than a generic embedding. revision: yes
Referee: [Abstract] Abstract: The statements that 'our LiftFormer achieves state-of-the-art performance on widely used datasets' and that 'an ablation study validates the effectiveness of both proposed lifting modules' are presented without any quantitative metrics, error bars, dataset names, baseline comparisons, or statistical tests. The full manuscript must supply these (including tables of Abs Rel, RMSE, etc.) to allow assessment of whether the subspace constructions actually drive the reported gains.

Authors: The full manuscript already contains Table 1 with Abs Rel, Sq Rel, RMSE, and other metrics on KITTI and NYU Depth V2 together with baseline comparisons, and Table 2 with the ablation results for the two lifting modules. Error bars are omitted following common practice in the field, but we will add a reproducibility note. To directly address the abstract, we will insert concise quantitative highlights (key metrics, datasets, and main baselines) into the revised abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; formulation is a modeling choice with external validation

full rationale

The paper formulates MDE as a transformation into DGR and ER subspaces constructed via frame theory with vectors aligned to depth bins, asserting that features 'correspond directly to the depth values' within this representation. This is presented as the core architectural proposal rather than a derived equality that reduces to inputs by construction. No equations are supplied that equate the claimed correspondence to a fitted parameter or prior self-result. SOTA performance and ablation results are reported on standard external datasets (e.g., widely used benchmarks), providing independent empirical grounding. No load-bearing self-citations, uniqueness theorems, or renamings of known results are evident in the provided text that would force the central claims. The derivation chain remains self-contained as a proposed lifting-based architecture.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claim rests on the unproven applicability of lifting theory topology to create a bridging subspace between color features and depth values, plus the assumption that frame theory with per-bin vectors yields a robust redundant representation; no independent evidence for these mappings is supplied in the abstract.

free parameters (2)

number of depth bins
Used to define the linearly dependent vectors that construct the DGR subspace
subspace dimensionality
Determines the size of the redundant representation but not quantified in the abstract

axioms (2)

domain assumption Lifting theory topology can construct an intermediate subspace that directly bridges image color features and geometric depth values
Invoked as the basis for the DGR construction in the abstract
domain assumption Frame theory with linearly dependent vectors per depth bin provides a redundant and robust representation for depth prediction
Stated as the mechanism that allows features to correspond directly to depth values

invented entities (2)

Depth-oriented geometric representation (DGR) subspace no independent evidence
purpose: To transform image features into a space where they correspond directly to depth values
Newly constructed using frame theory and depth bins
Edge-aware representation (ER) subspace no independent evidence
purpose: To enhance depth features around edges where sharp changes occur
Newly constructed to address common prediction errors at boundaries

pith-pipeline@v0.9.0 · 5578 in / 1730 out tokens · 79928 ms · 2026-05-10T19:05:25.679224+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A DGR subspace is constructed based on frame theory by using linearly dependent vectors in accordance with depth bins... The image spatial features are transformed into the DGR subspace, where these features correspond directly to the depth values.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

lifting theory topology, for constructing an intermediate subspace that bridges the image color features and depth values

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

79 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Unsupervised monocular depth estimation using attention and multi-warp reconstruction,

C. Ling, X. Zhang, and H. Chen, “Unsupervised monocular depth estimation using attention and multi-warp reconstruction,”IEEE Trans- actions on Multimedia, vol. 24, pp. 2938–2949, 2021

2021
[2]

Bayesian denet: Monocular depth prediction and frame-wise fusion with synchronized uncertainty,

X. Yang, Y . Gao, H. Luo, C. Liao, and K.-T. Cheng, “Bayesian denet: Monocular depth prediction and frame-wise fusion with synchronized uncertainty,”IEEE Transactions on Multimedia, vol. 21, no. 11, pp. 2701–2713, 2019

2019
[3]

Digging into self-supervised monocular depth estimation,

C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 3828– 3838

2019
[4]

Fast monocular depth estimation via side prediction ag- gregation with continuous spatial refinement,

J. Wu, R. Ji, Q. Wang, S. Zhang, X. Sun, Y . Wang, M. Xu, and F. Huang, “Fast monocular depth estimation via side prediction ag- gregation with continuous spatial refinement,”IEEE Transactions on Multimedia, vol. 25, pp. 1204–1216, 2023

2023
[5]

Vision transformers for dense prediction,

R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12 179–12 188

2021
[6]

Distortion-aware self-supervised indoor 360 ◦ depth estimation via hybrid projection fusion and structural regularities,

X. Wang, W. Kong, Q. Zhang, Y . Yang, T. Zhao, and J. Jiang, “Distortion-aware self-supervised indoor 360 ◦ depth estimation via hybrid projection fusion and structural regularities,”IEEE Transactions on Multimedia, vol. 26, pp. 3998–4011, 2024

2024
[7]

Towards comprehensive monocular depth estimation: Multiple heads are better than one,

S. Shao, R. Li, Z. Pei, Z. Liu, W. Chen, W. Zhu, X. Wu, and B. Zhang, “Towards comprehensive monocular depth estimation: Multiple heads are better than one,”IEEE Transactions on Multimedia, vol. 25, pp. 7660–7671, 2023

2023
[8]

3d packing for self-supervised monocular depth estimation,

V . Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon, “3d packing for self-supervised monocular depth estimation,” inIEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), 2020

2020
[9]

Shape- preserving object depth control for stereoscopic images,

J. Lei, B. Peng, C. Zhang, X. Mei, X. Cao, X. Fan, and X. Li, “Shape- preserving object depth control for stereoscopic images,”IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 12, pp. 3333–3344, 2018

2018
[10]

Real- time free viewpoint video synthesis system based on dibr and a depth estimation network,

S. Guo, J. Hu, K. Zhou, J. Wang, L. Song, R. Xie, and W. Zhang, “Real- time free viewpoint video synthesis system based on dibr and a depth estimation network,”IEEE Transactions on Multimedia, pp. 1–16, 2024

2024
[11]

Depth-assisted joint detection network for monocular 3d object detection,

J. Lei, T. Guo, B. Peng, and C. Yu, “Depth-assisted joint detection network for monocular 3d object detection,” in2021 IEEE International Conference on Image Processing (ICIP). IEEE, 2021, pp. 2204–2208

2021
[12]

A novel framework for pothole area estimation based on object detection and monocular metric depth estimation,

D. Wang, Y . Xu, H. Zhu, and K. Liu, “A novel framework for pothole area estimation based on object detection and monocular metric depth estimation,” in2024 IEEE International Conference on Signal, Information and Data Processing (ICSIDP), 2024, pp. 1–6

2024
[13]

The interpretation of structure from motion,

S. Ullman, “The interpretation of structure from motion,”Proceedings of the Royal Society of London. Series B. Biological Sciences, vol. 203, no. 1153, pp. 405–426, 1979

1979
[14]

Hmm-based sur- face reconstruction from single images,

T. Nagai, T. Naruse, M. Ikehara, and A. Kurematsu, “Hmm-based sur- face reconstruction from single images,” inProceedings. International Conference on Image Processing, vol. 2. IEEE, 2002, pp. II–II

2002
[15]

Depth map prediction from a single image using a multi-scale deep network,

D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,”Advances in neural information processing systems, vol. 27, 2014

2014
[16]

Neural window fully- connected crfs for monocular depth estimation,

W. Yuan, X. Gu, Z. Dai, S. Zhu, and P. Tan, “Neural window fully- connected crfs for monocular depth estimation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3916–3925

2022
[17]

Bevdepth: Acquisition of reliable depth for multi-view 3d object detec- tion,

Y . Li, Z. Ge, G. Yu, J. Yang, Z. Wang, Y . Shi, J. Sun, and Z. Li, “Bevdepth: Acquisition of reliable depth for multi-view 3d object detec- tion,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1477–1485

2023
[18]

Single image depth prediction made better: A multivariate gaussian take,

C. Liu, S. Kumar, S. Gu, R. Timofte, and L. Van Gool, “Single image depth prediction made better: A multivariate gaussian take,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 346–17 356

2023
[19]

Panoformer: Panorama transformer for indoor 360° depth estimation,

Z. Shen, C. Lin, K. Liao, L. Nie, Z. Zheng, and Y . Zhao, “Panoformer: Panorama transformer for indoor 360° depth estimation,” inEuropean Conference on Computer Vision, 2022

2022
[20]

Deep ordinal regression network for monocular depth estimation,

H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2002–2011

2018
[21]

Adabins: Depth estimation using adaptive bins,

S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4009–4018

2021
[22]

Binsformer: Revisiting adaptive bins for monocular depth estimation

Z. Li, X. Wang, X. Liu, and J. Jiang, “Binsformer: Revisiting adaptive bins for monocular depth estimation,”arXiv preprint arXiv:2204.00987, 2022

work page arXiv 2022
[23]

Attention attention everywhere: Monocular depth prediction with skip attention,

A. Agarwal and C. Arora, “Attention attention everywhere: Monocular depth prediction with skip attention,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 5861– 5870

2023
[24]

Depthformer: Exploiting long- range correlation and local information for accurate monocular depth estimation,

Z. Li, Z. Chen, X. Liu, and J. Jiang, “Depthformer: Exploiting long- range correlation and local information for accurate monocular depth estimation,”Machine Intelligence Research, pp. 1–18, 2023

2023
[25]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017
[26]

Convit: Improving vision transformers with soft con- volutional inductive biases,

S. d’Ascoli, H. Touvron, M. L. Leavitt, A. S. Morcos, G. Biroli, and L. Sagun, “Convit: Improving vision transformers with soft con- volutional inductive biases,” inInternational conference on machine learning. PMLR, 2021, pp. 2286–2296

2021
[27]

Ro- bust transformer with locality inductive bias and feature normalization,

O. N. Manzari, H. Kashiani, H. A. Dehkordi, and S. B. Shokouhi, “Ro- bust transformer with locality inductive bias and feature normalization,” Engineering Science and Technology, an International Journal, vol. 38, p. 101320, 2023

2023
[28]

Cvt: Introducing convolutions to vision transformers,

H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, “Cvt: Introducing convolutions to vision transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 22–31

2021
[29]

Cmt: Convolutional neural networks meet vision transformers,

J. Guo, K. Han, H. Wu, Y . Tang, X. Chen, Y . Wang, and C. Xu, “Cmt: Convolutional neural networks meet vision transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12 175–12 185

2022
[30]

Self-supervised monocular depth estimation with multi-constraints,

X. Yang, S. Zhang, and B. Zhao, “Self-supervised monocular depth estimation with multi-constraints,” in2021 40th Chinese Control Con- ference (CCC), 2021, pp. 8422–8427. 12

2021
[31]

Structure-aware residual pyramid network for monocular depth estimation,

X. Chen, X. Chen, and Z.-J. Zha, “Structure-aware residual pyramid network for monocular depth estimation,” inProceedings of the 28th International Joint Conference on Artificial Intelligence, ser. IJCAI’19. AAAI Press, 2019, p. 694–700

2019
[32]

From big to small: Multi-scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2019

J. H. Lee, M.-K. Han, D. W. Ko, and I. H. Suh, “From big to small: Multi-scale local planar guidance for monocular depth estimation,”arXiv preprint arXiv:1907.10326, 2019

work page arXiv 1907
[33]

Urcdc-depth: Uncertainty rectified cross-distillation with cutflip for monocular depth estimation,

S. Shao, Z. Pei, W. Chen, R. Li, Z. Liu, and Z. Li, “Urcdc-depth: Uncertainty rectified cross-distillation with cutflip for monocular depth estimation,”IEEE Transactions on Multimedia, vol. 26, pp. 3341–3353, 2024

2024
[34]

Va-depthnet: A variational approach to single image depth prediction,

C. Liu, S. Kumar, S. Gu, R. Timofte, and L. Van Gool, “Va-depthnet: A variational approach to single image depth prediction,”International Conference on Learning Representations (ICLR), Kigali, Rwanda, May 1-5, 2023

2023
[35]

Unsupervised monocular estimation of depth and visual odometry using attention and depth-pose consistency loss,

X. Song, H. Hu, L. Liang, W. Shi, G. Xie, X. Lu, and X. Hei, “Unsupervised monocular estimation of depth and visual odometry using attention and depth-pose consistency loss,”IEEE Transactions on Multimedia, vol. 26, pp. 3517–3529, 2024

2024
[36]

idisc: Internal discretization for monocular depth estimation,

L. Piccinelli, C. Sakaridis, and F. Yu, “idisc: Internal discretization for monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21 477–21 487

2023
[37]

Self-supervised monocular depth estimation with frequency-based recurrent refinement,

R. Li, D. Xue, Y . Zhu, H. Wu, J. Sun, and Y . Zhang, “Self-supervised monocular depth estimation with frequency-based recurrent refinement,” IEEE Transactions on Multimedia, vol. 25, pp. 5626–5637, 2023

2023
[38]

Laplacian pyramid neural network for dense continuous-value regression for complex scenes,

X. Chen, X. Chen, Y . Zhang, X. Fu, and Z.-J. Zha, “Laplacian pyramid neural network for dense continuous-value regression for complex scenes,”IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 11, pp. 5034–5046, 2021

2021
[39]

Rddepth: A lightweight algorithm for monocular depth estimation,

G. Xiong, J. Qi, Y . Peng, Y . Ping, and C. Wu, “Rddepth: A lightweight algorithm for monocular depth estimation,” in2024 4th International Conference on Computer, Control and Robotics (ICCCR), 2024, pp. 26– 30

2024
[40]

Deep convolutional neural fields for depth estimation from a single image,

F. Liu, C. Shen, and G. Lin, “Deep convolutional neural fields for depth estimation from a single image,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5162–5170

2015
[41]

Fastdepth: Fast monocular depth estimation on embedded systems,

D. Wofk, F. Ma, T.-J. Yang, S. Karaman, and V . Sze, “Fastdepth: Fast monocular depth estimation on embedded systems,” in2019 Interna- tional Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 6101–6108

2019
[42]

Learning depth from monocular videos using direct methods,

C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey, “Learning depth from monocular videos using direct methods,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2022–2030

2018
[43]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,”International Conference on Learning Representations (ICLR), Austria, May 3-7, 2021

2021
[44]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022

2021
[45]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255

2009
[46]

Simmim: A simple framework for masked image modeling,

Z. Xie, Z. Zhang, Y . Cao, Y . Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu, “Simmim: A simple framework for masked image modeling,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9653–9663

2022
[47]

Geonet: Geometric neural network for joint depth and surface normal estimation,

X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia, “Geonet: Geometric neural network for joint depth and surface normal estimation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 283–291

2018
[48]

Ddp: Diffusion model for dense visual prediction,

Y . Ji, Z. Chen, E. Xie, L. Hong, X. Liu, Z. Liu, T. Lu, Z. Li, and P. Luo, “Ddp: Diffusion model for dense visual prediction,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 21 741–21 752

2023
[49]

Scalable diffusion models with transformers,

W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4195–4205

2023
[50]

Depth anything: Unleashing the power of large-scale unlabeled data,

L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 371–10 381

2024
[51]

Fine-grained semantics-aware represen- tation enhancement for self-supervised monocular depth estimation,

H. Jung, E. Park, and S. Yoo, “Fine-grained semantics-aware represen- tation enhancement for self-supervised monocular depth estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 642–12 652

2021
[52]

Joint task- recursive learning for semantic segmentation and depth estimation,

Z. Zhang, Z. Cui, C. Xu, Z. Jie, X. Li, and J. Yang, “Joint task- recursive learning for semantic segmentation and depth estimation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 235–251

2018
[53]

Localbins: Improving depth estimation by learning local distributions,

S. F. Bhat, I. Alhashim, and P. Wonka, “Localbins: Improving depth estimation by learning local distributions,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 480–496

2022
[54]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. M ¨uller, “Zoedepth: Zero-shot transfer by combining relative and metric depth,”arXiv preprint arXiv:2302.12288, 2023

work page internal anchor Pith review arXiv 2023
[55]

Learning depth from single monocular images,

A. Saxena, S. Chung, and A. Ng, “Learning depth from single monocular images,”Advances in neural information processing systems, vol. 18, 2005

2005
[56]

Indoor robot navigation with single camera vision

G. C. Gini, A. Marchiet al., “Indoor robot navigation with single camera vision.”PRIS, vol. 2, pp. 67–76, 2002

2002
[57]

New algorithms from reconstruction of a 3-d depth map from one or more images,

M. Shao, T. Simchony, and R. Chellappa, “New algorithms from reconstruction of a 3-d depth map from one or more images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 1988, pp. 530–531

1988
[58]

Blur-aware disparity estimation from defocus stereo images,

C.-H. Chen, H. Zhou, and T. Ahonen, “Blur-aware disparity estimation from defocus stereo images,” inProceedings of the IEEE International Conference on Computer Vision, 2015, pp. 855–863

2015
[59]

An introduction to frames,

J. Kova ˇcevi´c, A. Chebiraet al., “An introduction to frames,”Foundations and Trends® in Signal Processing, vol. 2, no. 1, pp. 1–94, 2008

2008
[60]

Christensenet al.,An introduction to frames and Riesz bases

O. Christensenet al.,An introduction to frames and Riesz bases. Springer, 2003, vol. 7

2003
[61]

t-distributed stochastic neighbor embedding (t-sne): A tool for eco-physiological transcriptomic analysis,

M. C. Cieslak, A. M. Castelfranco, V . Roncalli, P. H. Lenz, and D. K. Hartline, “t-distributed stochastic neighbor embedding (t-sne): A tool for eco-physiological transcriptomic analysis,”Marine genomics, vol. 51, p. 100723, 2020

2020
[62]

Casenet: Deep category-aware semantic edge detection,

Z. Yu, C. Feng, M.-Y . Liu, and S. Ramalingam, “Casenet: Deep category-aware semantic edge detection,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5964– 5973

2017
[63]

Mind the edge: Refining depth edges in sparsely-supervised monocular depth estimation,

L. Talker, A. Cohen, E. Yosef, A. Dana, and M. Dinerstein, “Mind the edge: Refining depth edges in sparsely-supervised monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 10 606–10 616

2024
[64]

Re- thinking bisenet for real-time semantic segmentation,

M. Fan, S. Lai, J. Huang, X. Wei, Z. Chai, J. Luo, and X. Wei, “Re- thinking bisenet for real-time semantic segmentation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 9716–9725

2021
[65]

Edge boxes: Locating object proposals from edges,

C. L. Zitnick and P. Doll ´ar, “Edge boxes: Locating object proposals from edges,” inComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 391–405

2014
[66]

Monocular depth estimation with adaptive geometric attention,

T. Naderi, A. Sadovnik, J. Hayward, and H. Qi, “Monocular depth estimation with adaptive geometric attention,” inIEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022, pp. 617– 627

2022
[67]

Focal-wnet: An architecture uni- fying convolution and attention for depth estimation,

G. Manimaran and J. Swaminathan, “Focal-wnet: An architecture uni- fying convolution and attention for depth estimation,” in2022 IEEE 7th International conference for Convergence in Technology (I2CT). IEEE, 2022, pp. 1–7

2022
[68]

P3depth: Monocular depth estimation with a piecewise planarity prior,

V . Patil, C. Sakaridis, A. Liniger, and L. Van Gool, “P3depth: Monocular depth estimation with a piecewise planarity prior,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1610–1621

2022
[69]

Self-supervised joint learning framework of depth estimation via implicit cues,

J. Wang, G. Zhang, Z. Wu, X. Li, and L. Liu, “Self-supervised joint learning framework of depth estimation via implicit cues,”arXiv preprint arXiv:2006.09876, 2020

work page arXiv 2006
[70]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. Q. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . B. Huang, S.-W. Li, I. Misra, M. G. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. J ´egou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features w...

work page internal anchor Pith review Pith/arXiv arXiv
[71]

Available: https://api.semanticscholar.org/CorpusID: 258170077

[Online]. Available: https://api.semanticscholar.org/CorpusID: 258170077
[72]

Spatial pyramid pooling in deep convolutional networks for visual recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,”IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904– 1916, 2015. 13

1904
[73]

Irondepth: Iterative refinement of single-view depth using surface normal and its uncertainty,

G. Bae, I. Budvytis, and R. Cipolla, “Irondepth: Iterative refinement of single-view depth using surface normal and its uncertainty,” inBritish Machine Vision Conference (BMVC), 2022

2022
[74]

Meta-optimization for higher model generalizability in single-image depth prediction,

C.-Y . Wu, Y . Zhong, J. Wang, and U. Neumann, “Meta-optimization for higher model generalizability in single-image depth prediction,” International Conference on Learning Representations (ICLR), Kigali, Rwanda, May 1-5, 2023

2023
[75]

Improving deep regression with ordinal entropy,

S. Zhang, L. Yang, M. B. Mi, X. Zheng, and A. Yao, “Improving deep regression with ordinal entropy,”International Conference on Learning Representations (ICLR), Kigali, Rwanda, May 1-5, 2023

2023
[76]

Vision meets robotics: The kitti dataset,

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013

2013
[77]

Unsupervised cnn for single view depth estimation: Geometry to the rescue,

R. Garg, V . K. Bg, G. Carneiro, and I. Reid, “Unsupervised cnn for single view depth estimation: Geometry to the rescue,” inComputer Vision– ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. Springer, 2016, pp. 740–756

2016
[78]

Indoor segmentation and support inference from rgbd images,

N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” inECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12. Springer, 2012, pp. 746–760

2012
[79]

Patch-wise attention network for monocular depth estimation,

S. Lee, J. Lee, B. Kim, E. Yi, and J. Kim, “Patch-wise attention network for monocular depth estimation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, 2021, pp. 1873–1881

2021