Recognition: 2 theorem links
· Lean TheoremLiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation
Pith reviewed 2026-05-10 19:05 UTC · model grok-4.3
The pith
LiftFormer maps image features into depth-binned geometric subspaces to turn monocular depth estimation into direct representation matching.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A DGR subspace is built from linearly dependent vectors tied to depth bins to give a redundant representation; image features are mapped into it so they correspond directly to depth values. An ER subspace is constructed in parallel so that depth features can be used to boost local accuracy around edges.
What carries the argument
The two lifting modules that embed features into the DGR subspace (frame-theory depth bins) and the ER subspace (edge sharpening) to convert color-to-depth learning into subspace representation.
If this is right
- Depth prediction reduces to learning and matching representations inside a redundant, depth-binned space.
- Edge artifacts drop because the ER subspace supplies targeted local enhancement.
- The redundant frame-theory basis increases robustness to small variations in input features.
- Performance scales with standard monocular depth benchmarks without extra post-processing.
Where Pith is reading between the lines
- The same lifting pattern could be tested on related dense tasks such as surface normal estimation or instance depth.
- Replacing the backbone with other modern vision transformers would isolate whether the subspace modules alone drive the reported gains.
- Theoretical bounds on representation stability could be derived from the frame-theory construction to predict failure modes on new scene types.
Load-bearing premise
That features placed in the DGR subspace will map to correct depth values without losing global consistency or creating new boundary errors.
What would settle it
An experiment in which removing the DGR or ER lifting module leaves or improves accuracy on the same test sets and metrics.
Figures
read the original abstract
Monocular depth estimation (MDE) has attracted increasing interest in the past few years, owing to its important role in 3D vision. MDE is the estimation of a depth map from a monocular image/video to represent the 3D structure of a scene, which is a highly ill-posed problem. To solve this problem, in this paper, we propose a LiftFormer based on lifting theory topology, for constructing an intermediate subspace that bridges the image color features and depth values, and a subspace that enhances the depth prediction around edges. MDE is formulated by transforming the depth value prediction problem into depth-oriented geometric representation (DGR) subspace feature representation, thus bridging the learning from color values to geometric depth values. A DGR subspace is constructed based on frame theory by using linearly dependent vectors in accordance with depth bins to provide a redundant and robust representation. The image spatial features are transformed into the DGR subspace, where these features correspond directly to the depth values. Moreover, considering that edges usually present sharp changes in a depth map and tend to be erroneously predicted, an edge-aware representation (ER) subspace is constructed, where depth features are transformed and further used to enhance the local features around edges. The experimental results demonstrate that our LiftFormer achieves state-of-the-art performance on widely used datasets, and an ablation study validates the effectiveness of both proposed lifting modules in our LiftFormer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes LiftFormer for monocular depth estimation by reformulating depth prediction as a mapping of image spatial features into a depth-oriented geometric representation (DGR) subspace constructed via frame theory from linearly dependent vectors aligned with depth bins, thereby bridging color features to geometric depth values. An additional edge-aware representation (ER) subspace is introduced to transform depth features and enhance local predictions around edges. The paper claims this yields state-of-the-art performance on standard datasets, with ablations confirming the effectiveness of the two lifting modules.
Significance. If the asserted direct correspondence property of the DGR subspace can be rigorously derived from frame theory and the empirical gains prove robust and reproducible, the work could introduce a principled geometric embedding technique that improves interpretability and edge accuracy in depth estimation networks beyond standard encoder-decoder designs.
major comments (2)
- [Abstract] Abstract: The central claim that 'the image spatial features are transformed into the DGR subspace, where these features correspond directly to the depth values' is asserted without derivation. No equations or frame-theoretic properties (e.g., inner-product monotonicity with depth bins, invertibility of the redundant representation, or guarantees against artifacts) are supplied to show why the chosen linearly dependent frame vectors produce this direct correspondence rather than a generic embedding. This mapping is load-bearing for the lifting modules and for interpreting the ablation results as validation of the approach.
- [Abstract] Abstract: The statements that 'our LiftFormer achieves state-of-the-art performance on widely used datasets' and that 'an ablation study validates the effectiveness of both proposed lifting modules' are presented without any quantitative metrics, error bars, dataset names, baseline comparisons, or statistical tests. The full manuscript must supply these (including tables of Abs Rel, RMSE, etc.) to allow assessment of whether the subspace constructions actually drive the reported gains.
minor comments (2)
- [Abstract] The abstract refers to 'lifting theory topology' without a specific reference or brief explanation of the lifting scheme employed; a citation or short definition would improve accessibility.
- The free parameters (number of depth bins, subspace dimensionality) are introduced without discussion of sensitivity or selection criteria; a brief analysis or default values would strengthen reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript accordingly to improve theoretical rigor and clarity.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'the image spatial features are transformed into the DGR subspace, where these features correspond directly to the depth values' is asserted without derivation. No equations or frame-theoretic properties (e.g., inner-product monotonicity with depth bins, invertibility of the redundant representation, or guarantees against artifacts) are supplied to show why the chosen linearly dependent frame vectors produce this direct correspondence rather than a generic embedding. This mapping is load-bearing for the lifting modules and for interpreting the ablation results as validation of the approach.
Authors: We acknowledge that the abstract presents the direct correspondence claim without an explicit derivation. The manuscript's Section 3.2 describes the DGR subspace construction via frame theory using linearly dependent vectors aligned with depth bins, but we agree a self-contained derivation of key properties (inner-product monotonicity, invertibility of the redundant frame, and artifact bounds) is not provided. In the revised version we will insert a dedicated subsection with the required equations and frame-theoretic arguments to establish why this yields direct depth correspondence rather than a generic embedding. revision: yes
-
Referee: [Abstract] Abstract: The statements that 'our LiftFormer achieves state-of-the-art performance on widely used datasets' and that 'an ablation study validates the effectiveness of both proposed lifting modules' are presented without any quantitative metrics, error bars, dataset names, baseline comparisons, or statistical tests. The full manuscript must supply these (including tables of Abs Rel, RMSE, etc.) to allow assessment of whether the subspace constructions actually drive the reported gains.
Authors: The full manuscript already contains Table 1 with Abs Rel, Sq Rel, RMSE, and other metrics on KITTI and NYU Depth V2 together with baseline comparisons, and Table 2 with the ablation results for the two lifting modules. Error bars are omitted following common practice in the field, but we will add a reproducibility note. To directly address the abstract, we will insert concise quantitative highlights (key metrics, datasets, and main baselines) into the revised abstract. revision: yes
Circularity Check
No significant circularity detected; formulation is a modeling choice with external validation
full rationale
The paper formulates MDE as a transformation into DGR and ER subspaces constructed via frame theory with vectors aligned to depth bins, asserting that features 'correspond directly to the depth values' within this representation. This is presented as the core architectural proposal rather than a derived equality that reduces to inputs by construction. No equations are supplied that equate the claimed correspondence to a fitted parameter or prior self-result. SOTA performance and ablation results are reported on standard external datasets (e.g., widely used benchmarks), providing independent empirical grounding. No load-bearing self-citations, uniqueness theorems, or renamings of known results are evident in the provided text that would force the central claims. The derivation chain remains self-contained as a proposed lifting-based architecture.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of depth bins
- subspace dimensionality
axioms (2)
- domain assumption Lifting theory topology can construct an intermediate subspace that directly bridges image color features and geometric depth values
- domain assumption Frame theory with linearly dependent vectors per depth bin provides a redundant and robust representation for depth prediction
invented entities (2)
-
Depth-oriented geometric representation (DGR) subspace
no independent evidence
-
Edge-aware representation (ER) subspace
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A DGR subspace is constructed based on frame theory by using linearly dependent vectors in accordance with depth bins... The image spatial features are transformed into the DGR subspace, where these features correspond directly to the depth values.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
lifting theory topology, for constructing an intermediate subspace that bridges the image color features and depth values
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Unsupervised monocular depth estimation using attention and multi-warp reconstruction,
C. Ling, X. Zhang, and H. Chen, “Unsupervised monocular depth estimation using attention and multi-warp reconstruction,”IEEE Trans- actions on Multimedia, vol. 24, pp. 2938–2949, 2021
2021
-
[2]
Bayesian denet: Monocular depth prediction and frame-wise fusion with synchronized uncertainty,
X. Yang, Y . Gao, H. Luo, C. Liao, and K.-T. Cheng, “Bayesian denet: Monocular depth prediction and frame-wise fusion with synchronized uncertainty,”IEEE Transactions on Multimedia, vol. 21, no. 11, pp. 2701–2713, 2019
2019
-
[3]
Digging into self-supervised monocular depth estimation,
C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 3828– 3838
2019
-
[4]
Fast monocular depth estimation via side prediction ag- gregation with continuous spatial refinement,
J. Wu, R. Ji, Q. Wang, S. Zhang, X. Sun, Y . Wang, M. Xu, and F. Huang, “Fast monocular depth estimation via side prediction ag- gregation with continuous spatial refinement,”IEEE Transactions on Multimedia, vol. 25, pp. 1204–1216, 2023
2023
-
[5]
Vision transformers for dense prediction,
R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12 179–12 188
2021
-
[6]
Distortion-aware self-supervised indoor 360 ◦ depth estimation via hybrid projection fusion and structural regularities,
X. Wang, W. Kong, Q. Zhang, Y . Yang, T. Zhao, and J. Jiang, “Distortion-aware self-supervised indoor 360 ◦ depth estimation via hybrid projection fusion and structural regularities,”IEEE Transactions on Multimedia, vol. 26, pp. 3998–4011, 2024
2024
-
[7]
Towards comprehensive monocular depth estimation: Multiple heads are better than one,
S. Shao, R. Li, Z. Pei, Z. Liu, W. Chen, W. Zhu, X. Wu, and B. Zhang, “Towards comprehensive monocular depth estimation: Multiple heads are better than one,”IEEE Transactions on Multimedia, vol. 25, pp. 7660–7671, 2023
2023
-
[8]
3d packing for self-supervised monocular depth estimation,
V . Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon, “3d packing for self-supervised monocular depth estimation,” inIEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), 2020
2020
-
[9]
Shape- preserving object depth control for stereoscopic images,
J. Lei, B. Peng, C. Zhang, X. Mei, X. Cao, X. Fan, and X. Li, “Shape- preserving object depth control for stereoscopic images,”IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 12, pp. 3333–3344, 2018
2018
-
[10]
Real- time free viewpoint video synthesis system based on dibr and a depth estimation network,
S. Guo, J. Hu, K. Zhou, J. Wang, L. Song, R. Xie, and W. Zhang, “Real- time free viewpoint video synthesis system based on dibr and a depth estimation network,”IEEE Transactions on Multimedia, pp. 1–16, 2024
2024
-
[11]
Depth-assisted joint detection network for monocular 3d object detection,
J. Lei, T. Guo, B. Peng, and C. Yu, “Depth-assisted joint detection network for monocular 3d object detection,” in2021 IEEE International Conference on Image Processing (ICIP). IEEE, 2021, pp. 2204–2208
2021
-
[12]
A novel framework for pothole area estimation based on object detection and monocular metric depth estimation,
D. Wang, Y . Xu, H. Zhu, and K. Liu, “A novel framework for pothole area estimation based on object detection and monocular metric depth estimation,” in2024 IEEE International Conference on Signal, Information and Data Processing (ICSIDP), 2024, pp. 1–6
2024
-
[13]
The interpretation of structure from motion,
S. Ullman, “The interpretation of structure from motion,”Proceedings of the Royal Society of London. Series B. Biological Sciences, vol. 203, no. 1153, pp. 405–426, 1979
1979
-
[14]
Hmm-based sur- face reconstruction from single images,
T. Nagai, T. Naruse, M. Ikehara, and A. Kurematsu, “Hmm-based sur- face reconstruction from single images,” inProceedings. International Conference on Image Processing, vol. 2. IEEE, 2002, pp. II–II
2002
-
[15]
Depth map prediction from a single image using a multi-scale deep network,
D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,”Advances in neural information processing systems, vol. 27, 2014
2014
-
[16]
Neural window fully- connected crfs for monocular depth estimation,
W. Yuan, X. Gu, Z. Dai, S. Zhu, and P. Tan, “Neural window fully- connected crfs for monocular depth estimation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3916–3925
2022
-
[17]
Bevdepth: Acquisition of reliable depth for multi-view 3d object detec- tion,
Y . Li, Z. Ge, G. Yu, J. Yang, Z. Wang, Y . Shi, J. Sun, and Z. Li, “Bevdepth: Acquisition of reliable depth for multi-view 3d object detec- tion,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1477–1485
2023
-
[18]
Single image depth prediction made better: A multivariate gaussian take,
C. Liu, S. Kumar, S. Gu, R. Timofte, and L. Van Gool, “Single image depth prediction made better: A multivariate gaussian take,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 346–17 356
2023
-
[19]
Panoformer: Panorama transformer for indoor 360° depth estimation,
Z. Shen, C. Lin, K. Liao, L. Nie, Z. Zheng, and Y . Zhao, “Panoformer: Panorama transformer for indoor 360° depth estimation,” inEuropean Conference on Computer Vision, 2022
2022
-
[20]
Deep ordinal regression network for monocular depth estimation,
H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2002–2011
2018
-
[21]
Adabins: Depth estimation using adaptive bins,
S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4009–4018
2021
-
[22]
Binsformer: Revisiting adaptive bins for monocular depth estimation
Z. Li, X. Wang, X. Liu, and J. Jiang, “Binsformer: Revisiting adaptive bins for monocular depth estimation,”arXiv preprint arXiv:2204.00987, 2022
-
[23]
Attention attention everywhere: Monocular depth prediction with skip attention,
A. Agarwal and C. Arora, “Attention attention everywhere: Monocular depth prediction with skip attention,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 5861– 5870
2023
-
[24]
Depthformer: Exploiting long- range correlation and local information for accurate monocular depth estimation,
Z. Li, Z. Chen, X. Liu, and J. Jiang, “Depthformer: Exploiting long- range correlation and local information for accurate monocular depth estimation,”Machine Intelligence Research, pp. 1–18, 2023
2023
-
[25]
Attention is all you need,
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017
2017
-
[26]
Convit: Improving vision transformers with soft con- volutional inductive biases,
S. d’Ascoli, H. Touvron, M. L. Leavitt, A. S. Morcos, G. Biroli, and L. Sagun, “Convit: Improving vision transformers with soft con- volutional inductive biases,” inInternational conference on machine learning. PMLR, 2021, pp. 2286–2296
2021
-
[27]
Ro- bust transformer with locality inductive bias and feature normalization,
O. N. Manzari, H. Kashiani, H. A. Dehkordi, and S. B. Shokouhi, “Ro- bust transformer with locality inductive bias and feature normalization,” Engineering Science and Technology, an International Journal, vol. 38, p. 101320, 2023
2023
-
[28]
Cvt: Introducing convolutions to vision transformers,
H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, “Cvt: Introducing convolutions to vision transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 22–31
2021
-
[29]
Cmt: Convolutional neural networks meet vision transformers,
J. Guo, K. Han, H. Wu, Y . Tang, X. Chen, Y . Wang, and C. Xu, “Cmt: Convolutional neural networks meet vision transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12 175–12 185
2022
-
[30]
Self-supervised monocular depth estimation with multi-constraints,
X. Yang, S. Zhang, and B. Zhao, “Self-supervised monocular depth estimation with multi-constraints,” in2021 40th Chinese Control Con- ference (CCC), 2021, pp. 8422–8427. 12
2021
-
[31]
Structure-aware residual pyramid network for monocular depth estimation,
X. Chen, X. Chen, and Z.-J. Zha, “Structure-aware residual pyramid network for monocular depth estimation,” inProceedings of the 28th International Joint Conference on Artificial Intelligence, ser. IJCAI’19. AAAI Press, 2019, p. 694–700
2019
-
[32]
J. H. Lee, M.-K. Han, D. W. Ko, and I. H. Suh, “From big to small: Multi-scale local planar guidance for monocular depth estimation,”arXiv preprint arXiv:1907.10326, 2019
-
[33]
Urcdc-depth: Uncertainty rectified cross-distillation with cutflip for monocular depth estimation,
S. Shao, Z. Pei, W. Chen, R. Li, Z. Liu, and Z. Li, “Urcdc-depth: Uncertainty rectified cross-distillation with cutflip for monocular depth estimation,”IEEE Transactions on Multimedia, vol. 26, pp. 3341–3353, 2024
2024
-
[34]
Va-depthnet: A variational approach to single image depth prediction,
C. Liu, S. Kumar, S. Gu, R. Timofte, and L. Van Gool, “Va-depthnet: A variational approach to single image depth prediction,”International Conference on Learning Representations (ICLR), Kigali, Rwanda, May 1-5, 2023
2023
-
[35]
Unsupervised monocular estimation of depth and visual odometry using attention and depth-pose consistency loss,
X. Song, H. Hu, L. Liang, W. Shi, G. Xie, X. Lu, and X. Hei, “Unsupervised monocular estimation of depth and visual odometry using attention and depth-pose consistency loss,”IEEE Transactions on Multimedia, vol. 26, pp. 3517–3529, 2024
2024
-
[36]
idisc: Internal discretization for monocular depth estimation,
L. Piccinelli, C. Sakaridis, and F. Yu, “idisc: Internal discretization for monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21 477–21 487
2023
-
[37]
Self-supervised monocular depth estimation with frequency-based recurrent refinement,
R. Li, D. Xue, Y . Zhu, H. Wu, J. Sun, and Y . Zhang, “Self-supervised monocular depth estimation with frequency-based recurrent refinement,” IEEE Transactions on Multimedia, vol. 25, pp. 5626–5637, 2023
2023
-
[38]
Laplacian pyramid neural network for dense continuous-value regression for complex scenes,
X. Chen, X. Chen, Y . Zhang, X. Fu, and Z.-J. Zha, “Laplacian pyramid neural network for dense continuous-value regression for complex scenes,”IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 11, pp. 5034–5046, 2021
2021
-
[39]
Rddepth: A lightweight algorithm for monocular depth estimation,
G. Xiong, J. Qi, Y . Peng, Y . Ping, and C. Wu, “Rddepth: A lightweight algorithm for monocular depth estimation,” in2024 4th International Conference on Computer, Control and Robotics (ICCCR), 2024, pp. 26– 30
2024
-
[40]
Deep convolutional neural fields for depth estimation from a single image,
F. Liu, C. Shen, and G. Lin, “Deep convolutional neural fields for depth estimation from a single image,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5162–5170
2015
-
[41]
Fastdepth: Fast monocular depth estimation on embedded systems,
D. Wofk, F. Ma, T.-J. Yang, S. Karaman, and V . Sze, “Fastdepth: Fast monocular depth estimation on embedded systems,” in2019 Interna- tional Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 6101–6108
2019
-
[42]
Learning depth from monocular videos using direct methods,
C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey, “Learning depth from monocular videos using direct methods,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2022–2030
2018
-
[43]
An image is worth 16x16 words: Transformers for image recognition at scale,
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,”International Conference on Learning Representations (ICLR), Austria, May 3-7, 2021
2021
-
[44]
Swin transformer: Hierarchical vision transformer using shifted windows,
Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022
2021
-
[45]
Imagenet: A large-scale hierarchical image database,
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255
2009
-
[46]
Simmim: A simple framework for masked image modeling,
Z. Xie, Z. Zhang, Y . Cao, Y . Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu, “Simmim: A simple framework for masked image modeling,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9653–9663
2022
-
[47]
Geonet: Geometric neural network for joint depth and surface normal estimation,
X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia, “Geonet: Geometric neural network for joint depth and surface normal estimation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 283–291
2018
-
[48]
Ddp: Diffusion model for dense visual prediction,
Y . Ji, Z. Chen, E. Xie, L. Hong, X. Liu, Z. Liu, T. Lu, Z. Li, and P. Luo, “Ddp: Diffusion model for dense visual prediction,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 21 741–21 752
2023
-
[49]
Scalable diffusion models with transformers,
W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4195–4205
2023
-
[50]
Depth anything: Unleashing the power of large-scale unlabeled data,
L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 371–10 381
2024
-
[51]
Fine-grained semantics-aware represen- tation enhancement for self-supervised monocular depth estimation,
H. Jung, E. Park, and S. Yoo, “Fine-grained semantics-aware represen- tation enhancement for self-supervised monocular depth estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 642–12 652
2021
-
[52]
Joint task- recursive learning for semantic segmentation and depth estimation,
Z. Zhang, Z. Cui, C. Xu, Z. Jie, X. Li, and J. Yang, “Joint task- recursive learning for semantic segmentation and depth estimation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 235–251
2018
-
[53]
Localbins: Improving depth estimation by learning local distributions,
S. F. Bhat, I. Alhashim, and P. Wonka, “Localbins: Improving depth estimation by learning local distributions,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 480–496
2022
-
[54]
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. M ¨uller, “Zoedepth: Zero-shot transfer by combining relative and metric depth,”arXiv preprint arXiv:2302.12288, 2023
work page internal anchor Pith review arXiv 2023
-
[55]
Learning depth from single monocular images,
A. Saxena, S. Chung, and A. Ng, “Learning depth from single monocular images,”Advances in neural information processing systems, vol. 18, 2005
2005
-
[56]
Indoor robot navigation with single camera vision
G. C. Gini, A. Marchiet al., “Indoor robot navigation with single camera vision.”PRIS, vol. 2, pp. 67–76, 2002
2002
-
[57]
New algorithms from reconstruction of a 3-d depth map from one or more images,
M. Shao, T. Simchony, and R. Chellappa, “New algorithms from reconstruction of a 3-d depth map from one or more images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 1988, pp. 530–531
1988
-
[58]
Blur-aware disparity estimation from defocus stereo images,
C.-H. Chen, H. Zhou, and T. Ahonen, “Blur-aware disparity estimation from defocus stereo images,” inProceedings of the IEEE International Conference on Computer Vision, 2015, pp. 855–863
2015
-
[59]
An introduction to frames,
J. Kova ˇcevi´c, A. Chebiraet al., “An introduction to frames,”Foundations and Trends® in Signal Processing, vol. 2, no. 1, pp. 1–94, 2008
2008
-
[60]
Christensenet al.,An introduction to frames and Riesz bases
O. Christensenet al.,An introduction to frames and Riesz bases. Springer, 2003, vol. 7
2003
-
[61]
t-distributed stochastic neighbor embedding (t-sne): A tool for eco-physiological transcriptomic analysis,
M. C. Cieslak, A. M. Castelfranco, V . Roncalli, P. H. Lenz, and D. K. Hartline, “t-distributed stochastic neighbor embedding (t-sne): A tool for eco-physiological transcriptomic analysis,”Marine genomics, vol. 51, p. 100723, 2020
2020
-
[62]
Casenet: Deep category-aware semantic edge detection,
Z. Yu, C. Feng, M.-Y . Liu, and S. Ramalingam, “Casenet: Deep category-aware semantic edge detection,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5964– 5973
2017
-
[63]
Mind the edge: Refining depth edges in sparsely-supervised monocular depth estimation,
L. Talker, A. Cohen, E. Yosef, A. Dana, and M. Dinerstein, “Mind the edge: Refining depth edges in sparsely-supervised monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 10 606–10 616
2024
-
[64]
Re- thinking bisenet for real-time semantic segmentation,
M. Fan, S. Lai, J. Huang, X. Wei, Z. Chai, J. Luo, and X. Wei, “Re- thinking bisenet for real-time semantic segmentation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 9716–9725
2021
-
[65]
Edge boxes: Locating object proposals from edges,
C. L. Zitnick and P. Doll ´ar, “Edge boxes: Locating object proposals from edges,” inComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 391–405
2014
-
[66]
Monocular depth estimation with adaptive geometric attention,
T. Naderi, A. Sadovnik, J. Hayward, and H. Qi, “Monocular depth estimation with adaptive geometric attention,” inIEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022, pp. 617– 627
2022
-
[67]
Focal-wnet: An architecture uni- fying convolution and attention for depth estimation,
G. Manimaran and J. Swaminathan, “Focal-wnet: An architecture uni- fying convolution and attention for depth estimation,” in2022 IEEE 7th International conference for Convergence in Technology (I2CT). IEEE, 2022, pp. 1–7
2022
-
[68]
P3depth: Monocular depth estimation with a piecewise planarity prior,
V . Patil, C. Sakaridis, A. Liniger, and L. Van Gool, “P3depth: Monocular depth estimation with a piecewise planarity prior,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1610–1621
2022
-
[69]
Self-supervised joint learning framework of depth estimation via implicit cues,
J. Wang, G. Zhang, Z. Wu, X. Li, and L. Liu, “Self-supervised joint learning framework of depth estimation via implicit cues,”arXiv preprint arXiv:2006.09876, 2020
-
[70]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, T. Darcet, T. Moutakanni, H. Q. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . B. Huang, S.-W. Li, I. Misra, M. G. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. J ´egou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features w...
work page internal anchor Pith review Pith/arXiv arXiv
-
[71]
Available: https://api.semanticscholar.org/CorpusID: 258170077
[Online]. Available: https://api.semanticscholar.org/CorpusID: 258170077
-
[72]
Spatial pyramid pooling in deep convolutional networks for visual recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,”IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904– 1916, 2015. 13
1904
-
[73]
Irondepth: Iterative refinement of single-view depth using surface normal and its uncertainty,
G. Bae, I. Budvytis, and R. Cipolla, “Irondepth: Iterative refinement of single-view depth using surface normal and its uncertainty,” inBritish Machine Vision Conference (BMVC), 2022
2022
-
[74]
Meta-optimization for higher model generalizability in single-image depth prediction,
C.-Y . Wu, Y . Zhong, J. Wang, and U. Neumann, “Meta-optimization for higher model generalizability in single-image depth prediction,” International Conference on Learning Representations (ICLR), Kigali, Rwanda, May 1-5, 2023
2023
-
[75]
Improving deep regression with ordinal entropy,
S. Zhang, L. Yang, M. B. Mi, X. Zheng, and A. Yao, “Improving deep regression with ordinal entropy,”International Conference on Learning Representations (ICLR), Kigali, Rwanda, May 1-5, 2023
2023
-
[76]
Vision meets robotics: The kitti dataset,
A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013
2013
-
[77]
Unsupervised cnn for single view depth estimation: Geometry to the rescue,
R. Garg, V . K. Bg, G. Carneiro, and I. Reid, “Unsupervised cnn for single view depth estimation: Geometry to the rescue,” inComputer Vision– ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. Springer, 2016, pp. 740–756
2016
-
[78]
Indoor segmentation and support inference from rgbd images,
N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” inECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12. Springer, 2012, pp. 746–760
2012
-
[79]
Patch-wise attention network for monocular depth estimation,
S. Lee, J. Lee, B. Kim, E. Yi, and J. Kim, “Patch-wise attention network for monocular depth estimation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, 2021, pp. 1873–1881
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.