SA4Depth: Consistent Pose-Depth Scale Alignment for Self-Supervised Monocular Depth Estimation

Changxuan Li; Federico Tombari; Nadine Berner; Nassir Navab; Stefano Gasperini

arxiv: 2605.28477 · v1 · pith:MEUQ7UEMnew · submitted 2026-05-27 · 💻 cs.CV

SA4Depth: Consistent Pose-Depth Scale Alignment for Self-Supervised Monocular Depth Estimation

Changxuan Li , Nadine Berner , Nassir Navab , Federico Tombari , Stefano Gasperini This is my paper

Pith reviewed 2026-06-29 12:43 UTC · model grok-4.3

classification 💻 cs.CV

keywords self-supervised monocular depth estimationpose refinementscale alignmentfeature reprojectionKITTINYUv2Cityscapes

0 comments

The pith

Reprojecting depth-estimated features refines pose and aligns scene scales in self-supervised monocular depth estimation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Self-supervised monocular depth estimation jointly trains a depth network and a pose network from image sequences, yet each estimates its own scene scale, often producing inconsistent results across frames or sequences. The paper establishes that feeding the current depth estimate into a reprojection of learnable visual features between consecutive frames, then minimizing the resulting alignment residuals, supplies a training signal that corrects the pose network. Once the scales match, depth predictions become more consistent without any change to the networks used at test time. Experiments on KITTI, Cityscapes and NYUv2 show the improvement, and KITTI Odometry results confirm the pose refinement. A reader would care because reliable scale alignment removes a hidden source of error that has limited many existing self-supervised pipelines.

Core claim

By using the depth estimated during training to reproject learnable visual features across consecutive frames and refining the pose estimates through reduction of feature alignment residuals, the scene scales produced by the separate depth and pose networks become aligned and the scale consistency of depth predictions improves across different sequences.

What carries the argument

A differentiable pose-refinement step that reprojects learnable visual features using the current depth estimate and minimizes the resulting frame-to-frame alignment residuals.

If this is right

Depth accuracy improves on both outdoor sequences (KITTI, Cityscapes) and indoor scenes (NYUv2).
Pose estimates become more accurate, as verified on KITTI Odometry benchmarks.
Scale consistency across sequences rises without any added computation at inference time.
The refinement module can be inserted into existing self-supervised training pipelines without architectural changes.
The method leaves inference speed and model size unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same residual-minimization idea could be tested on joint depth-and-flow estimation to see whether scale drift decreases over longer videos.
If feature quality is the dominant factor, replacing the learnable features with fixed pretrained descriptors would provide a direct test of how much the improvement relies on joint learning.
The approach may reduce the need for explicit scale-normalization post-processing when the depth maps are fed into downstream visual odometry systems.

Load-bearing premise

That the reduction of feature alignment residuals will correct scale mismatches between the depth and pose networks rather than allowing unrelated errors to trade off against each other.

What would settle it

Apply the refinement on NYUv2 sequences, compute the variance of recovered scale factors across consecutive frames before and after, and observe that the variance does not decrease while depth accuracy stays the same or worsens.

Figures

Figures reproduced from arXiv: 2605.28477 by Changxuan Li, Federico Tombari, Nadine Berner, Nassir Navab, Stefano Gasperini.

**Figure 2.** Figure 2: Overview of the proposed SA4Depth, with our contributions to SSMDE highlighted in light blue. Compared to conventional SSMDE(left), our () [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on KITTI Eigen [10], [35] among scale-aware methods. White arrows highlight coarse wrong estimates. Input image Monodepth2-50 (md2) FUMET on md2 ours on md2 MonoViT ours on MonoViT [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on Cityscapes [11] among scale-aware methods. White arrows highlight coarse wrong estimates. the MASt3R [40] feature (15) and the PixLoc feature (16). Experiment (15) outperforms the baseline, demonstrating that pose refinement is effective with a suitable visual feature metric. Still, our method (6) is better, demonstrating the necessity of learning robust visual features jointly du… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on NYUv2 [12], errors marked in white [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Visual odometry results on the KITTI Odometry Seq. 09 (Left) and [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: More qualitative comparison on KITTI Eigen Split. The depth predictions are up-to-scale. Comparison also includes every second row, the improved [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: More qualitative comparison on Cityscapes. The depth estimates are up-to-scale. White arrows highlight coarse wrong estimates. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: More qualitative comparison on NYUv2. The depth estimates are up-to-scale. White marks highlight coarse wrong estimates. [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Feature and confidence maps extracted by the feature net, which is trained on KITTI. The feature maps are processed by PCA. The confidence [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

read the original abstract

Self-supervised depth estimation from monocular sequences relies on the joint learning of a depth and a pose network. Despite abundant research done to improve the depth network, efforts on the pose remain limited. In this context, even when depth is estimated up to scale, we highlight the importance of the alignment between the scene scales estimated by the pose and depth nets. Then, we introduce SA4Depth, an approach to improve this alignment and boost the depth predictions while keeping the inference time unchanged. Our proposed method uses the depth estimated during training to reproject learnable visual features across consecutive frames and refine the pose estimates by reducing feature alignment residuals. With our method, the estimated scene scales by the separate depth and pose networks are aligned, and the prediction scale consistency is improved across different sequences. Our differentiable refinement integrates seamlessly into existing self-supervised pipelines and substantially improves their depth estimates. We demonstrate this with extensive experiments both outdoors and indoors on KITTI, Cityscapes, and NYUv2. Additionally, results on KITTI Odometry confirm the effectiveness of our pose refinement. Our code is available at https://github.com/Runningchauncey/SA4Depth .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SA4Depth adds a feature re-projection module to align depth and pose scales in self-supervised estimation, with solid experiments but a gap in proving the scale mechanism.

read the letter

The punchline is that SA4Depth introduces a training-time pose refinement that reprojects features using the depth estimate to reduce alignment residuals, aiming to align the scales predicted by the depth and pose networks.

What the paper does well is integrate this refinement into existing self-supervised frameworks without affecting inference speed. The experiments cover standard benchmarks: improvements in depth estimation on KITTI, Cityscapes, and NYUv2, and better results on KITTI Odometry for pose. Releasing the code at the GitHub link allows direct verification and use by others.

The soft spots are around the central claim of scale alignment. The approach minimizes feature residuals after depth-based re-projection, which can improve consistency in practice. However, as noted in the stress-test, this does not necessarily enforce a consistent global scene scale; the optimization could reach a point where residuals are low through local pose tweaks that do not match the overall scale between networks. The abstract does not include the loss equations or ablations that would show whether the stationary point corresponds to matched scales. If the full paper has controlled experiments isolating the scale effect or metrics specifically for scale consistency across sequences, that would strengthen it. Otherwise, the gains might come from the additional loss acting as a regularizer rather than the claimed alignment.

This paper is for people working on self-supervised depth estimation from monocular video, particularly those using separate depth and pose networks. A reader interested in practical improvements to established pipelines would get value from the module and the reported results. It deserves a serious referee because the contribution is a clear, implementable addition with experiments on multiple datasets and public code, even if the theoretical link to scale consistency needs more scrutiny in review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SA4Depth for self-supervised monocular depth estimation. It identifies scale inconsistency between jointly trained depth and pose networks as a key limitation and proposes to address it by using the depth estimate to reproject learnable visual features across frames, then refining the pose network by minimizing the resulting feature alignment residuals. The method is presented as a differentiable module that integrates into existing pipelines, improves depth accuracy, and yields more consistent scale predictions across sequences, with supporting experiments on KITTI, Cityscapes, NYUv2, and KITTI Odometry.

Significance. If the proposed feature-residual refinement demonstrably enforces global scale alignment rather than permitting local compensation, the approach would be a practical and low-overhead contribution to self-supervised depth pipelines. The unchanged inference cost, public code release, and multi-dataset evaluation (including odometry) are strengths. The core idea targets a recognized but under-addressed inconsistency between the two networks.

major comments (2)

[Abstract / §3] Abstract and §3 (method description): the central claim that minimizing depth-reprojected feature residuals aligns the scene scales estimated by the depth and pose networks lacks an explicit derivation or stationary-point analysis showing that the equilibrium of the combined loss corresponds to matched scales rather than to compensating local adjustments in either network. The construction permits pose adjustments that absorb depth errors without enforcing global scale consistency.
[§4] §4 (experiments): the reported depth improvements and cross-sequence scale consistency gains are not accompanied by an ablation that isolates the scale-alignment effect from other changes in the training objective or from hyper-parameter tuning; without such controls it is difficult to attribute gains specifically to the claimed mechanism.

minor comments (2)

[§3] Notation for the feature reprojection and residual loss should be introduced with explicit equations rather than descriptive prose only.
[§3] The manuscript should clarify whether the learnable visual features are frozen or jointly optimized with the depth/pose networks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (method description): the central claim that minimizing depth-reprojected feature residuals aligns the scene scales estimated by the depth and pose networks lacks an explicit derivation or stationary-point analysis showing that the equilibrium of the combined loss corresponds to matched scales rather than to compensating local adjustments in either network. The construction permits pose adjustments that absorb depth errors without enforcing global scale consistency.

Authors: We acknowledge that the manuscript provides no formal stationary-point analysis or derivation establishing that the equilibrium necessarily enforces global scale matching rather than local compensations. The design intends global alignment because the feature re-projection operates over the full image and multiple frames, but we agree this is not rigorously shown. In revision we will add a short explanatory paragraph in §3 clarifying the mechanism and why local pose adjustments alone cannot minimize the residuals without scale consistency; this constitutes a partial revision as the core method description remains unchanged. revision: partial
Referee: [§4] §4 (experiments): the reported depth improvements and cross-sequence scale consistency gains are not accompanied by an ablation that isolates the scale-alignment effect from other changes in the training objective or from hyper-parameter tuning; without such controls it is difficult to attribute gains specifically to the claimed mechanism.

Authors: We agree that the experiments would benefit from ablations that isolate the contribution of the scale-alignment refinement. The current results compare complete pipelines but do not hold all other factors fixed while toggling only the proposed module. In the revised manuscript we will include controlled ablations that enable or disable the SA4Depth refinement while keeping the training objective, hyperparameters, and network architectures identical, thereby directly attributing improvements to the scale-alignment component. revision: yes

Circularity Check

0 steps flagged

No circularity: scale alignment presented as optimization outcome, not tautological definition

full rationale

The paper introduces an auxiliary differentiable loss that reprojects features using the depth network output to refine the pose network via residual minimization. The claim that this produces consistent scene-scale alignment between the two networks is framed as an empirical consequence of the joint optimization rather than a quantity defined in terms of itself or recovered by fitting a parameter to the target metric. No self-citation chain, ansatz smuggling, or renaming of a known result is used to establish the central result; the derivation remains self-contained against external benchmarks such as KITTI and NYUv2 depth metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The method implicitly assumes that feature re-projection residuals are a reliable proxy for scale mismatch.

pith-pipeline@v0.9.1-grok · 5749 in / 1107 out tokens · 29005 ms · 2026-06-29T12:43:23.733556+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Holistic 3d scene understanding from a single image with implicit representation,

C. Zhang, Z. Cui, Y . Zhang, B. Zeng, M. Pollefeys, and S. Liu, “Holistic 3d scene understanding from a single image with implicit representation,” inCVPR, 2021, pp. 8833–8842

2021
[2]

DepthSplat: Connecting gaussian splatting and depth,

H. Xu, S. Peng, F. Wang, H. Blum, D. Barath, A. Geiger, and M. Polle- feys, “DepthSplat: Connecting gaussian splatting and depth,” inCVPR, 2025, pp. 16 453–16 463

2025
[3]

Visual attention- based self-supervised absolute depth estimation using geometric priors in autonomous driving,

J. Xiang, Y . Wang, L. An, H. Liu, Z. Wang, and J. Liu, “Visual attention- based self-supervised absolute depth estimation using geometric priors in autonomous driving,”IEEE RAL, vol. 7, no. 4, pp. 11 998–12 005, 2022

2022
[4]

Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,

R. Mur-Artal and J. D. Tard ´os, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,”IEEE Trans. Robot., vol. 33, no. 5, pp. 1255–1262, 2017

2017
[5]

Adabins: Depth estimation using adaptive bins,

S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” inCVPR, 2021, pp. 4009–4018

2021
[6]

Depth anything: Unleashing the power of large-scale unlabeled data,

L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” inCVPR, 2024, pp. 10 371–10 381

2024
[7]

3D packing for self-supervised monocular depth estimation,

V . Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon, “3D packing for self-supervised monocular depth estimation,” inCVPR, 2020, pp. 2485–2494

2020
[8]

Digging into self-supervised monocular depth estimation,

C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” inCVPR, 2019, pp. 3828–3838

2019
[9]

Unsupervised learning of depth and ego-motion from video,

T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” inCVPR, 2017, pp. 1851–1858

2017
[10]

Vision meets robotics: The KITTI dataset,

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,”IJRR, 2013

2013
[11]

The cityscapes dataset for semantic urban scene understanding,

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- son, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” inCVPR, 2016

2016
[12]

Indoor segmen- tation and support inference from rgbd images,

P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor segmen- tation and support inference from rgbd images,” inECCV, 2012

2012
[13]

MonoViT: Self-supervised monocular depth estimation with a vision transformer,

C. Zhao, Y . Zhang, M. Poggi, F. Tosi, X. Guo, Z. Zhu, G. Huang, Y . Tang, and S. Mattoccia, “MonoViT: Self-supervised monocular depth estimation with a vision transformer,” in3DV. IEEE, 2022, pp. 668– 678

2022
[14]

Camera height doesn’t change: Unsu- pervised training for metric monocular road-scene depth estimation,

G. Kinoshita and K. Nishino, “Camera height doesn’t change: Unsu- pervised training for metric monocular road-scene depth estimation,” in ECCV. Springer, 2024, pp. 57–73

2024
[15]

R4dyn: Exploring radar for self-supervised monocular depth estimation of dynamic scenes,

S. Gasperini, P. Koch, V . Dallabetta, N. Navab, B. Busam, and F. Tombari, “R4dyn: Exploring radar for self-supervised monocular depth estimation of dynamic scenes,” in3DV. IEEE, 2021, pp. 751–760

2021
[16]

Robust monocular depth estimation under challenging conditions,

S. Gasperini, N. Morbitzer, H. Jung, N. Navab, and F. Tombari, “Robust monocular depth estimation under challenging conditions,” inCVPR, 2023, pp. 8177–8186

2023
[17]

Learning depth from monocular videos using direct methods,

C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey, “Learning depth from monocular videos using direct methods,” inCVPR, 2018, pp. 2022– 2030

2018
[18]

Towards better generalization: Joint depth-pose learning without posenet,

W. Zhao, S. Liu, Y . Shu, and Y .-J. Liu, “Towards better generalization: Joint depth-pose learning without posenet,” inCVPR, 2020, pp. 9151– 9161

2020
[19]

DualRefine: Self- supervised depth and pose estimation through iterative epipolar sampling and refinement toward equilibrium,

A. Bangunharcana, A. Magd, and K.-S. Kim, “DualRefine: Self- supervised depth and pose estimation through iterative epipolar sampling and refinement toward equilibrium,” inCVPR, 2023, pp. 726–738

2023
[20]

Are we ready for autonomous driving? the KITTI vision benchmark suite,

A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” inCVPR, 2012

2012
[21]

Unsupervised learning of monocular depth estimation and visual odom- etry with deep feature reconstruction,

H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and I. Reid, “Unsupervised learning of monocular depth estimation and visual odom- etry with deep feature reconstruction,” inCVPR, 2018, pp. 340–349

2018
[22]

Mono-vifi: A unified learning framework for self-supervised single and multi-frame monocular depth estimation,

J. Liu, L. Kong, B. Li, Z. Wang, H. Gu, and J. Chen, “Mono-vifi: A unified learning framework for self-supervised single and multi-frame monocular depth estimation,” inECCV. Springer, 2024, pp. 90–107

2024
[23]

Channel-wise attention-based net- work for self-supervised monocular depth estimation,

J. Yan, H. Zhao, P. Bu, and Y . Jin, “Channel-wise attention-based net- work for self-supervised monocular depth estimation,” in3DV. IEEE, 2021, pp. 464–473

2021
[24]

Monoindoor++: Towards better practice of self-supervised monocular depth estimation for indoor envi- ronments,

R. Li, P. Ji, Y . Xu, and B. Bhanu, “Monoindoor++: Towards better practice of self-supervised monocular depth estimation for indoor envi- ronments,”IEEE TCSVT, vol. 33, no. 2, pp. 830–846, 2022

2022
[25]

Superglue: Learning feature matching with graph neural networks,

P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” inCVPR, 2020, pp. 4938–4947

2020
[26]

LM-Reloc: Levenberg-Marquardt based direct visual relocalization,

L. V on Stumberg, P. Wenzel, N. Yang, and D. Cremers, “LM-Reloc: Levenberg-Marquardt based direct visual relocalization,” in3DV. IEEE, 2020, pp. 968–977

2020
[27]

Back to the feature: Learning robust camera localization from pixels to pose,

P.-E. Sarlin, A. Unagar, M. Larsson, H. Germain, C. Toft, V . Larsson, M. Pollefeys, V . Lepetit, L. Hammarstrand, F. Kahlet al., “Back to the feature: Learning robust camera localization from pixels to pose,” in CVPR, 2021, pp. 3247–3257

2021
[28]

Relative pose estimation through affine corrections of monocular depth priors,

Y . Yu, S. Liu, R. Pautrat, M. Pollefeys, and V . Larsson, “Relative pose estimation through affine corrections of monocular depth priors,” in CVPR, 2025, pp. 16 706–16 716

2025
[29]

Virtual KITTI 2

Y . Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,”arXiv preprint arXiv:2001.10773, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[30]

Cnn-slam: Real-time dense monocular slam with learned depth prediction,

K. Tateno, F. Tombari, I. Laina, and N. Navab, “Cnn-slam: Real-time dense monocular slam with learned depth prediction,” inCVPR, 2017, pp. 6243–6252

2017
[31]

Sparsity invariant cnns,

J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant cnns,” in3DV. IEEE, 2017, pp. 11–20

2017
[32]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inCVPR, 2016, pp. 770–778

2016
[33]

Hybrid-grained feature aggregation with coarse-to- fine language guidance for self-supervised monocular depth estimation,

W. Zhang, H. Liu, B. Li, J. He, Z. Qi, Y . Wang, S. Zhao, X. Yu, W. Zeng, and X. Jin, “Hybrid-grained feature aggregation with coarse-to- fine language guidance for self-supervised monocular depth estimation,” inICCV, 2025, pp. 6678–6692

2025
[34]

Adam: A Method for Stochastic Optimization

D. P. Kingma, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[35]

Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,

D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” inICCV, 2015, pp. 2650–2658

2015
[36]

The temporal opportunist: Self-supervised multi-frame monocular depth,

J. Watson, O. Mac Aodha, V . Prisacariu, G. Brostow, and M. Fir- man, “The temporal opportunist: Self-supervised multi-frame monocular depth,” inCVPR, 2021, pp. 1164–1174

2021
[37]

Auto- rectify network for unsupervised indoor depth estimation,

J.-W. Bian, H. Zhan, N. Wang, T.-J. Chin, C. Shen, and I. Reid, “Auto- rectify network for unsupervised indoor depth estimation,”IEEE TPAMI, vol. 44, no. 12, pp. 9802–9813, 2021

2021
[38]

Visual odometry revisited: What should be learnt?

H. Zhan, C. S. Weerasekera, J.-W. Bian, and I. Reid, “Visual odometry revisited: What should be learnt?” inICRA. IEEE, 2020, pp. 4203– 4210

2020
[39]

Unsupervised scale-consistent depth and ego-motion learning from monocular video,

J. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M.-M. Cheng, and I. Reid, “Unsupervised scale-consistent depth and ego-motion learning from monocular video,”NeurIPS, vol. 32, 2019

2019
[40]

Grounding image matching in 3d with mast3r,

V . Leroy, Y . Cabon, and J. Revaud, “Grounding image matching in 3d with mast3r,” inECCV. Springer, 2024, pp. 71–91

2024
[41]

Depthcrafter: Generating consistent long depth sequences for open- world videos,

W. Hu, X. Gao, X. Li, S. Zhao, X. Cun, Y . Zhang, L. Quan, and Y . Shan, “Depthcrafter: Generating consistent long depth sequences for open- world videos,” inCVPR, 2025, pp. 2005–2015

2025
[42]

Video depth anything: Consistent depth estimation for super-long videos,

S. Chen, H. Guo, S. Zhu, F. Zhang, Z. Huang, J. Feng, and B. Kang, “Video depth anything: Consistent depth estimation for super-long videos,” inCVPR, 2025, pp. 22 831–22 840. 9 APPENDIX A. ROBUSTNESS AGAINSTDEPTHESTIMATIONNOISE We present quantitative results of new experiments in Tab. VII to investigate the robustness of the joint train- ing framework ag...

2025

[1] [1]

Holistic 3d scene understanding from a single image with implicit representation,

C. Zhang, Z. Cui, Y . Zhang, B. Zeng, M. Pollefeys, and S. Liu, “Holistic 3d scene understanding from a single image with implicit representation,” inCVPR, 2021, pp. 8833–8842

2021

[2] [2]

DepthSplat: Connecting gaussian splatting and depth,

H. Xu, S. Peng, F. Wang, H. Blum, D. Barath, A. Geiger, and M. Polle- feys, “DepthSplat: Connecting gaussian splatting and depth,” inCVPR, 2025, pp. 16 453–16 463

2025

[3] [3]

Visual attention- based self-supervised absolute depth estimation using geometric priors in autonomous driving,

J. Xiang, Y . Wang, L. An, H. Liu, Z. Wang, and J. Liu, “Visual attention- based self-supervised absolute depth estimation using geometric priors in autonomous driving,”IEEE RAL, vol. 7, no. 4, pp. 11 998–12 005, 2022

2022

[4] [4]

Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,

R. Mur-Artal and J. D. Tard ´os, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,”IEEE Trans. Robot., vol. 33, no. 5, pp. 1255–1262, 2017

2017

[5] [5]

Adabins: Depth estimation using adaptive bins,

S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” inCVPR, 2021, pp. 4009–4018

2021

[6] [6]

Depth anything: Unleashing the power of large-scale unlabeled data,

L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” inCVPR, 2024, pp. 10 371–10 381

2024

[7] [7]

3D packing for self-supervised monocular depth estimation,

V . Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon, “3D packing for self-supervised monocular depth estimation,” inCVPR, 2020, pp. 2485–2494

2020

[8] [8]

Digging into self-supervised monocular depth estimation,

C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” inCVPR, 2019, pp. 3828–3838

2019

[9] [9]

Unsupervised learning of depth and ego-motion from video,

T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” inCVPR, 2017, pp. 1851–1858

2017

[10] [10]

Vision meets robotics: The KITTI dataset,

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,”IJRR, 2013

2013

[11] [11]

The cityscapes dataset for semantic urban scene understanding,

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- son, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” inCVPR, 2016

2016

[12] [12]

Indoor segmen- tation and support inference from rgbd images,

P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor segmen- tation and support inference from rgbd images,” inECCV, 2012

2012

[13] [13]

MonoViT: Self-supervised monocular depth estimation with a vision transformer,

C. Zhao, Y . Zhang, M. Poggi, F. Tosi, X. Guo, Z. Zhu, G. Huang, Y . Tang, and S. Mattoccia, “MonoViT: Self-supervised monocular depth estimation with a vision transformer,” in3DV. IEEE, 2022, pp. 668– 678

2022

[14] [14]

Camera height doesn’t change: Unsu- pervised training for metric monocular road-scene depth estimation,

G. Kinoshita and K. Nishino, “Camera height doesn’t change: Unsu- pervised training for metric monocular road-scene depth estimation,” in ECCV. Springer, 2024, pp. 57–73

2024

[15] [15]

R4dyn: Exploring radar for self-supervised monocular depth estimation of dynamic scenes,

S. Gasperini, P. Koch, V . Dallabetta, N. Navab, B. Busam, and F. Tombari, “R4dyn: Exploring radar for self-supervised monocular depth estimation of dynamic scenes,” in3DV. IEEE, 2021, pp. 751–760

2021

[16] [16]

Robust monocular depth estimation under challenging conditions,

S. Gasperini, N. Morbitzer, H. Jung, N. Navab, and F. Tombari, “Robust monocular depth estimation under challenging conditions,” inCVPR, 2023, pp. 8177–8186

2023

[17] [17]

Learning depth from monocular videos using direct methods,

C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey, “Learning depth from monocular videos using direct methods,” inCVPR, 2018, pp. 2022– 2030

2018

[18] [18]

Towards better generalization: Joint depth-pose learning without posenet,

W. Zhao, S. Liu, Y . Shu, and Y .-J. Liu, “Towards better generalization: Joint depth-pose learning without posenet,” inCVPR, 2020, pp. 9151– 9161

2020

[19] [19]

DualRefine: Self- supervised depth and pose estimation through iterative epipolar sampling and refinement toward equilibrium,

A. Bangunharcana, A. Magd, and K.-S. Kim, “DualRefine: Self- supervised depth and pose estimation through iterative epipolar sampling and refinement toward equilibrium,” inCVPR, 2023, pp. 726–738

2023

[20] [20]

Are we ready for autonomous driving? the KITTI vision benchmark suite,

A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” inCVPR, 2012

2012

[21] [21]

Unsupervised learning of monocular depth estimation and visual odom- etry with deep feature reconstruction,

H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and I. Reid, “Unsupervised learning of monocular depth estimation and visual odom- etry with deep feature reconstruction,” inCVPR, 2018, pp. 340–349

2018

[22] [22]

Mono-vifi: A unified learning framework for self-supervised single and multi-frame monocular depth estimation,

J. Liu, L. Kong, B. Li, Z. Wang, H. Gu, and J. Chen, “Mono-vifi: A unified learning framework for self-supervised single and multi-frame monocular depth estimation,” inECCV. Springer, 2024, pp. 90–107

2024

[23] [23]

Channel-wise attention-based net- work for self-supervised monocular depth estimation,

J. Yan, H. Zhao, P. Bu, and Y . Jin, “Channel-wise attention-based net- work for self-supervised monocular depth estimation,” in3DV. IEEE, 2021, pp. 464–473

2021

[24] [24]

Monoindoor++: Towards better practice of self-supervised monocular depth estimation for indoor envi- ronments,

R. Li, P. Ji, Y . Xu, and B. Bhanu, “Monoindoor++: Towards better practice of self-supervised monocular depth estimation for indoor envi- ronments,”IEEE TCSVT, vol. 33, no. 2, pp. 830–846, 2022

2022

[25] [25]

Superglue: Learning feature matching with graph neural networks,

P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” inCVPR, 2020, pp. 4938–4947

2020

[26] [26]

LM-Reloc: Levenberg-Marquardt based direct visual relocalization,

L. V on Stumberg, P. Wenzel, N. Yang, and D. Cremers, “LM-Reloc: Levenberg-Marquardt based direct visual relocalization,” in3DV. IEEE, 2020, pp. 968–977

2020

[27] [27]

Back to the feature: Learning robust camera localization from pixels to pose,

P.-E. Sarlin, A. Unagar, M. Larsson, H. Germain, C. Toft, V . Larsson, M. Pollefeys, V . Lepetit, L. Hammarstrand, F. Kahlet al., “Back to the feature: Learning robust camera localization from pixels to pose,” in CVPR, 2021, pp. 3247–3257

2021

[28] [28]

Relative pose estimation through affine corrections of monocular depth priors,

Y . Yu, S. Liu, R. Pautrat, M. Pollefeys, and V . Larsson, “Relative pose estimation through affine corrections of monocular depth priors,” in CVPR, 2025, pp. 16 706–16 716

2025

[29] [29]

Virtual KITTI 2

Y . Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,”arXiv preprint arXiv:2001.10773, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[30] [30]

Cnn-slam: Real-time dense monocular slam with learned depth prediction,

K. Tateno, F. Tombari, I. Laina, and N. Navab, “Cnn-slam: Real-time dense monocular slam with learned depth prediction,” inCVPR, 2017, pp. 6243–6252

2017

[31] [31]

Sparsity invariant cnns,

J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant cnns,” in3DV. IEEE, 2017, pp. 11–20

2017

[32] [32]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inCVPR, 2016, pp. 770–778

2016

[33] [33]

Hybrid-grained feature aggregation with coarse-to- fine language guidance for self-supervised monocular depth estimation,

W. Zhang, H. Liu, B. Li, J. He, Z. Qi, Y . Wang, S. Zhao, X. Yu, W. Zeng, and X. Jin, “Hybrid-grained feature aggregation with coarse-to- fine language guidance for self-supervised monocular depth estimation,” inICCV, 2025, pp. 6678–6692

2025

[34] [34]

Adam: A Method for Stochastic Optimization

D. P. Kingma, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[35] [35]

Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,

D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” inICCV, 2015, pp. 2650–2658

2015

[36] [36]

The temporal opportunist: Self-supervised multi-frame monocular depth,

J. Watson, O. Mac Aodha, V . Prisacariu, G. Brostow, and M. Fir- man, “The temporal opportunist: Self-supervised multi-frame monocular depth,” inCVPR, 2021, pp. 1164–1174

2021

[37] [37]

Auto- rectify network for unsupervised indoor depth estimation,

J.-W. Bian, H. Zhan, N. Wang, T.-J. Chin, C. Shen, and I. Reid, “Auto- rectify network for unsupervised indoor depth estimation,”IEEE TPAMI, vol. 44, no. 12, pp. 9802–9813, 2021

2021

[38] [38]

Visual odometry revisited: What should be learnt?

H. Zhan, C. S. Weerasekera, J.-W. Bian, and I. Reid, “Visual odometry revisited: What should be learnt?” inICRA. IEEE, 2020, pp. 4203– 4210

2020

[39] [39]

Unsupervised scale-consistent depth and ego-motion learning from monocular video,

J. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M.-M. Cheng, and I. Reid, “Unsupervised scale-consistent depth and ego-motion learning from monocular video,”NeurIPS, vol. 32, 2019

2019

[40] [40]

Grounding image matching in 3d with mast3r,

V . Leroy, Y . Cabon, and J. Revaud, “Grounding image matching in 3d with mast3r,” inECCV. Springer, 2024, pp. 71–91

2024

[41] [41]

Depthcrafter: Generating consistent long depth sequences for open- world videos,

W. Hu, X. Gao, X. Li, S. Zhao, X. Cun, Y . Zhang, L. Quan, and Y . Shan, “Depthcrafter: Generating consistent long depth sequences for open- world videos,” inCVPR, 2025, pp. 2005–2015

2025

[42] [42]

Video depth anything: Consistent depth estimation for super-long videos,

S. Chen, H. Guo, S. Zhu, F. Zhang, Z. Huang, J. Feng, and B. Kang, “Video depth anything: Consistent depth estimation for super-long videos,” inCVPR, 2025, pp. 22 831–22 840. 9 APPENDIX A. ROBUSTNESS AGAINSTDEPTHESTIMATIONNOISE We present quantitative results of new experiments in Tab. VII to investigate the robustness of the joint train- ing framework ag...

2025