DrivingDepth: Sparse-Prompted Pixel-wise Scale Correction for Driving Depth Estimation

Bosheng Wang; Chi Huang; Hang Yin; Hao Li; Liang Wang; Wenhao Zhang; Xun Sun; Yuan Wang

arxiv: 2606.31488 · v1 · pith:YIS3KPBJnew · submitted 2026-06-30 · 💻 cs.CV

DrivingDepth: Sparse-Prompted Pixel-wise Scale Correction for Driving Depth Estimation

Chi Huang , Wenhao Zhang , Hang Yin , YuAn Wang , Hao Li , Bosheng Wang , Xun Sun , Liang Wang This is my paper

Pith reviewed 2026-07-01 06:10 UTC · model grok-4.3

classification 💻 cs.CV

keywords depth estimationautonomous drivingfoundation modelssparse promptsscale correctionnuScenesgeometric consistencymetric depth

0 comments

The pith

Foundation models already encode coherent relative depth for driving, so only per-pixel scale correction from sparse LiDAR is needed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that depth foundation models already capture geometrically coherent relative depth maps, making additional surface structure learning unnecessary. The missing element for metric depth is solely a per-pixel scale factor that maps the relative geometry to absolute coordinates. DrivingDepth uses sparse LiDAR points as prompts to compute a residual scale correction map that locally calibrates the frozen model without regenerating depth. This preserves dense visual geometry by construction and avoids the artifacts seen in methods that override the prior. On nuScenes surround-view data the approach reports improved metric accuracy alongside better edge consistency than regeneration baselines.

Core claim

Foundation models already capture geometrically coherent relative depth; no additional surface structure learning is required-only a per-pixel scale factor mapping relative geometry to metric coordinates. DrivingDepth treats sparse LiDAR as geometric prompts that locally calibrate a frozen foundation prior through residual pixel-wise scale correction, preserving dense visual geometry by construction. On nuScenes with 4-frame surround-view input, DrivingDepth achieves an AbsRel of 11.19 and an EdgeCR of 5.741, outperforming MapAnything (11.99/1.914).

What carries the argument

residual pixel-wise scale correction that treats sparse LiDAR as local geometric prompts on a frozen foundation prior

If this is right

Metric depth is obtained without regenerating surface structure, preserving the foundation model's geometric coherence.
Sparse LiDAR serves only for local calibration rather than as primary depth source.
On nuScenes 4-frame surround-view input the method reaches AbsRel 11.19 and EdgeCR 5.741 while beating regeneration baselines.
Geometric consistency improves because the dense relative depth prior is left intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of geometry and scale may reduce the amount of dense supervision required for new driving domains.
Similar residual correction could be tested on other foundation-model outputs where metric scale is the primary gap.
Hardware requirements might shift toward fewer LiDAR beams if scale calibration proves robust.

Load-bearing premise

Foundation models already deliver pixel-aligned dense visual geometry whose only missing element is a per-pixel metric scale factor.

What would settle it

If the scale-corrected output exhibits structural artifacts on visually continuous surfaces or fails to align with dense ground-truth geometry where available, the claim that only scale is missing would be refuted.

Figures

Figures reproduced from arXiv: 2606.31488 by Bosheng Wang, Chi Huang, Hang Yin, Hao Li, Liang Wang, Wenhao Zhang, Xun Sun, Yuan Wang.

**Figure 1.** Figure 1: Comparison of DrivingDepth with DepthAnything3 [19] and MapAnything [15]. Left: DrivingDepth’s full-scene 3D reconstruction. Right: 2D depth maps and 3D point clouds. LiDAR points colored by Z-height; point clouds with projected image RGB. DrivingDepth maintains RGB-depth consistency while yielding correct scales. More demos are available at DrivingDepth-page. Abstract Dense depth estimation for autonomous… view at source ↗

**Figure 2.** Figure 2: Architecture of DrivingDepth. The frozen DA3 [19] maps surround-view images I and camera parameters π to a dense depth prior Dprior. The Geometry-Preserving Feature Adapter fuses image tokens with sparse-depth tokens and propagates them under constrained cross-view/frame attention, yielding depth-aware features {Fℓ}. The Sparse-Aware Pixel-Scale Head turns {Fℓ} and a LiDAR prompt into a per-pixel correctio… view at source ↗

**Figure 3.** Figure 3: 2D depth comparison on nuScenes [3] (rows 1–4) and DDAD [11] (rows 5–6). Columns: RGB, MOGE-2 [30], PriorDA [33], MapAnything [15], DA3 [19], Spix, and DrivingDepth output. Monocular methods use frame-wise alignment, while multi-view methods adopt clip-wise alignment. In Spix: gray≈1, blue<1, red>1. DrivingDepth output approximates DA3 depth scaled element-wise by Spix. Input-context analysis [PITH_FULL_I… view at source ↗

**Figure 4.** Figure 4: 3D reconstruction comparison on nuScenes [3]. Predicted point clouds (RGB-colored) and LiDAR (colored by height) rendered together. Depth outputs of all methods are aligned using ROE[29]. Left: RGB image (top) and DrivingDepth global point cloud (bottom). Right: zoomed crops comparing DA3 [19], MapAnything [15], and DrivingDepth. nuScenes scene, including colored point clouds and zoomed close-up views. The… view at source ↗

**Figure 5.** Figure 5: Surface-normal weight trade-off visualization. Left: RGB with sparse LiDAR overlay. Top row: predicted depth under different λnorm. Bottom row: per-pixel absolute relative error (blue=low, red=high). At λnorm=2, a visible horizontal line at the LiDAR coverage boundary indicates that the model breaks image-aligned structure in unsupervised regions when normal regularization is insufficient. λnorm AbsRel↓ δ… view at source ↗

read the original abstract

Dense depth estimation for autonomous driving faces a geometry-scale conflict: depth foundation models deliver pixel-aligned dense visual geometry without reliable metric scale, while projected LiDAR provides metric anchors that are sparse, noisy, and misaligned with image structures. Existing sparse-prompted methods incorporate LiDAR by regenerating depth from scratch, overriding the foundation model's coherent geometry and producing structural artifacts on visually continuous surfaces. Our key insight is that foundation models already capture geometrically coherent relative depth; no additional surface structure learning is required-only a per-pixel scale factor mapping relative geometry to metric coordinates. Based on this, we propose DrivingDepth, which treats sparse LiDAR as geometric prompts that locally calibrate a frozen foundation prior through residual pixel-wise scale correction, preserving dense visual geometry by construction. On nuScenes with 4-frame surround-view input, DrivingDepth achieves an AbsRel of 11.19 and an EdgeCR of 5.741, outperforming MapAnything (11.99/1.914) by simultaneously delivering SOTA metric accuracy and geometric consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move is freezing a depth foundation model and learning only a per-pixel residual scale map from sparse LiDAR prompts, which is a clean way to fix metric scale without rebuilding geometry.

read the letter

The central idea here is that foundation models already give you coherent relative depth, so the only missing piece is a per-pixel scale factor that sparse LiDAR can supply locally. DrivingDepth learns a multiplicative correction map on top of the frozen prior instead of regenerating the whole depth field. On nuScenes with 4-frame surround views it reports AbsRel of 11.19 and EdgeCR of 5.741, beating MapAnything on both metric accuracy and edge consistency.

What stands out is the decision to treat LiDAR strictly as calibration prompts rather than as input that overrides the visual geometry. That choice directly targets the artifacts the authors say appear when other sparse-prompted methods start from scratch. The evaluation metrics line up with the claim: one for scale error, one for structural fidelity.

The soft spots are the missing pieces that would let a reader judge whether the correction actually works as advertised. The abstract gives no architecture for the scale network, no ablation on LiDAR density or alignment handling, and no error breakdown by scene type or distance. Without those, the modest numerical edge could come from implementation details that are not yet visible. The assumption that the foundation model already supplies pixel-perfect relative geometry also needs checking against real driving data where lighting, reflections, or motion blur can degrade the prior.

This is aimed at researchers who already use depth foundation models in driving pipelines and want a lightweight way to add metric scale. A reader working on sensor fusion or efficient adaptation would find the framing useful even if the experiments need expansion.

It deserves a serious referee to look at the method section and full results, because the argument is internally consistent and the reported metrics address the right quantities.

Referee Report

0 major / 1 minor

Summary. The paper claims that foundation models already provide pixel-aligned dense relative depth for driving scenes, so metric depth estimation requires only a per-pixel multiplicative scale correction (prompted by sparse LiDAR) rather than regenerating depth structure. DrivingDepth implements this by treating LiDAR points as geometric prompts to locally calibrate a frozen foundation prior via residual scale correction, thereby preserving visual geometry by construction. On nuScenes with 4-frame surround-view input, it reports AbsRel of 11.19 and EdgeCR of 5.741, outperforming MapAnything (11.99/1.914).

Significance. If the central claim holds, the work shows that sparse-prompted scale correction suffices to convert foundation-model relative depth into metric depth without structural artifacts, which could simplify pipelines that currently override foundation priors. The dual reporting of metric error and geometric consistency metrics directly tests the preservation claim.

minor comments (1)

The abstract states performance numbers but does not indicate whether the scale-correction network is trained end-to-end or with frozen components beyond the foundation model; a one-sentence clarification in §3 or the abstract would help readers assess implementation scope.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. The provided summary correctly captures the core claim that foundation models already deliver pixel-aligned relative depth, so that metric conversion reduces to per-pixel scale correction prompted by sparse LiDAR.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper states its key insight directly as an assumption (foundation models supply coherent relative depth; only per-pixel scale correction is required) and builds the method as a direct implementation of that assumption via residual scale correction prompted by sparse LiDAR. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would reduce the claimed result to its inputs by construction. The evaluation metrics (AbsRel, EdgeCR) address the stated goals without evidence of statistical forcing or definitional equivalence. This is the normal case of a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5724 in / 1146 out tokens · 39134 ms · 2026-07-01T06:10:48.489629+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 9 canonical work pages · 4 internal anchors

[1]

AdaBins: Depth estimation using adaptive bins

Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. AdaBins: Depth estimation using adaptive bins. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4009–4018, 2021. 2

2021
[2]

Depth Pro: Sharp monocular metric depth in less than a second

Alexey Bochkovskiy, Ama ¨el Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan Richter, and Vladlen Koltun. Depth Pro: Sharp monocular metric depth in less than a second. InInternational Conference on Learning Rep- resentations, pages 75602–75637, 2025. 1, 2, 6

2025
[3]

Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuScenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11621–11631, 2020. 2, 5, 6, 7, 8, 9

2020
[4]

Yuanzhouhan Cao, Zifeng Wu, and Chunhua Shen. Esti- mating depth from monocular images as classification using deep fully convolutional residual networks.IEEE Transac- tions on Circuits and Systems for Video Technology, 28(11): 3174–3182, 2017. 2

2017
[5]

TTT3R: 3D Reconstruction as Test-Time Training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. TTT3R: 3D reconstruction as test-time training. arXiv preprint arXiv:2509.26645, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

DGGT: Feedforward 4D reconstruction of dynamic driving scenes using unposed images

Xiaoxue Chen, Ziyi Xiong, Yuantao Chen, Gen Li, Nan Wang, Hongcheng Luo, Long Chen, Haiyang Sun, Bing Wang, Guang Chen, Hongyang Li, Ya-Qin Zhang, Hangjun Ye, and Hao Zhao. DGGT: Feedforward 4D reconstruction of dynamic driving scenes using unposed images. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages ...

2026
[7]

Depth map prediction from a single image using a multi-scale deep net- work.Advances in Neural Information Processing Systems, 27, 2014

David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep net- work.Advances in Neural Information Processing Systems, 27, 2014. 2

2014
[8]

Deep ordinal regression net- work for monocular depth estimation

Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat- manghelich, and Dacheng Tao. Deep ordinal regression net- work for monocular depth estimation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 2002–2011, 2018. 2

2002
[9]

Unsupervised monocular depth estimation with left- right consistency

Cl ´ement Godard, Oisin Mac Aodha, and Gabriel J Bros- tow. Unsupervised monocular depth estimation with left- right consistency. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 270–279, 2017. 2

2017
[10]

Digging into self-supervised monocular depth estimation

Cl ´ement Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self-supervised monocular depth estimation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 3828– 3838, 2019. 2

2019
[11]

3D packing for self-supervised monocular depth estimation

Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raven- tos, and Adrien Gaidon. 3D packing for self-supervised monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2485–2494, 2020. 2, 5, 6, 7, 9

2020
[12]

Dist-4D: Disentangled spatiotemporal diffusion with metric depth for 4D driving scene generation

Jiazhe Guo, Yikang Ding, Xiwu Chen, Shuo Chen, Bohan Li, Yingshuang Zou, Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Zhiheng Li, et al. Dist-4D: Disentangled spatiotemporal diffusion with metric depth for 4D driving scene generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27231–27241, 2025. 2

2025
[13]

DynamicVGGT: Learning dynamic point maps for 4D scene reconstruction in autonomous driving.arXiv preprint arXiv:2603.08254, 2026

Zhuolin He, Jing Li, Guanghao Li, Xiaolei Chen, Jiacheng Tang, Siyang Zhang, Zhounan Jin, Feipeng Cai, Bin Li, Jian Pu, et al. DynamicVGGT: Learning dynamic point maps for 4D scene reconstruction in autonomous driving.arXiv preprint arXiv:2603.08254, 2026. 3

work page arXiv 2026
[14]

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3D v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10579–10596, 2024. 2

2024
[15]

MapAnything: Universal feed- forward metric 3D reconstruction

Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bul`o, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Universal feed- forward metric 3D reconstructio...

2026
[16]

From big to small: Multi-scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2019

Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. From big to small: Multi-scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2019. 2

work page arXiv 1907
[17]

Ground- ing image matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing image matching in 3D with MASt3R. InProceedings of the European Conference on Computer Vision (ECCV),
[18]

Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation

Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Ji- aming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, and Bingyi Kang. Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17070–17080, 2025. 2, 4, 6

2025
[19]

Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth Any- thing 3: Recovering the visual space from any views. InThe Fourteenth International Conference on Learning Represen- tations, 2026. 1, 2, 3, 4, 5, 6, 7, 8

2026
[20]

VGD: Visual geometry gaussian splat- ting for feed-forward surround-view driving reconstruction

Junhong Lin, Kangli Wang, Shunzhou Wang, Songlin Fan, Ge Li, and Wei Gao. VGD: Visual geometry gaussian splat- ting for feed-forward surround-view driving reconstruction. arXiv preprint arXiv:2510.19578, 2025. 3

work page arXiv 2025
[21]

Learning depth from single monocular images using deep convolutional neural fields.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 38(10):2024–2039,

Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. Learning depth from single monocular images using deep convolutional neural fields.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 38(10):2024–2039,

2024
[22]

Metri- cAnything: Scaling metric depth pretraining with noisy het- erogeneous sources.arXiv preprint arXiv:2601.22054, 2026

Baorui Ma, Jiahui Yang, Donglin Di, Xuancheng Zhang, Jianxun Cui, Hao Li, Yan Xie, and Wei Chen. Metri- cAnything: Scaling metric depth pretraining with noisy het- erogeneous sources.arXiv preprint arXiv:2601.22054, 2026. 5

work page arXiv 2026
[23]

UniDepth: Universal monocular metric depth estimation

Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. UniDepth: Universal monocular metric depth estimation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

2024
[24]

Vi- sion transformers for dense prediction

Rene Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12179–12188, 2021. 2

2021
[25]

FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

You Shen, Zhipeng Zhang, Yansong Qu, Xiawu Zheng, Ji- ayi Ji, Shengchuan Zhang, and Liujuan Cao. FastVGGT: Training-free acceleration of visual geometry transformer. arXiv preprint arXiv:2509.02560, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Driv- ingForward: Feed-forward 3D gaussian splatting for driv- ing scene reconstruction from flexible surround-view input

Qijian Tian, Xin Tan, Yuan Xie, and Lizhuang Ma. Driv- ingForward: Feed-forward 3D gaussian splatting for driv- ing scene reconstruction from flexible surround-view input. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 7374–7382, 2025. 3

2025
[27]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5294–5306, 2025. 2

2025
[28]

Efros, and Angjoo Kanazawa

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3D perception model with persistent state. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10510–10522, 2025. 2

2025
[29]

MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5261–5271, 2025. 3, 5, 6, 7, 8

2025
[30]

MoGe-2: Accurate monocular geometry with metric scale and sharp details.Advances in Neural Information Pro- cessing Systems, 38:35928–35959, 2026

Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. MoGe-2: Accurate monocular geometry with metric scale and sharp details.Advances in Neural Information Pro- cessing Systems, 38:35928–35959, 2026. 1, 2, 6, 7

2026
[31]

DUSt3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 20697–20709, 2024. 2, 5

2024
[32]

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chun- hua Shen, and Tong He.π 3: Permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Depth Anything with any prior

Zehan Wang, Siyu Chen, Lihe Yang, Jialei Wang, Ziang Zhang, Hengshuang Zhao, and Zhou Zhao. Depth Anything with any prior. InThe Fourteenth International Conference on Learning Representations, 2026. 2, 6, 7

2026
[34]

Surround- Depth: Entangling surrounding views for self-supervised multi-camera depth estimation

Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Yong- ming Rao, Guan Huang, Jiwen Lu, and Jie Zhou. Surround- Depth: Entangling surrounding views for self-supervised multi-camera depth estimation. InProceedings of The 6th Conference on Robot Learning, pages 539–549. PMLR,
[35]

Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

Tao Xie, Peishan Yang, Yudong Jin, Yingfeng Cai, Wei Yin, Weiqiang Ren, Qian Zhang, Wei Hua, Sida Peng, Xiaoyang Guo, et al. SCAL3R: Scalable test-time training for large- scale 3D reconstruction.arXiv preprint arXiv:2604.08542,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Depth Anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth Anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 1, 2

2024
[37]

Depth Any- thing V2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth Any- thing V2. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 2

2024
[38]

InfiniDepth: Arbitrary-resolution and fine- grained depth estimation with neural implicit fields

Hao Yu, Haotong Lin, Jiawei Wang, Jiaxin Li, Yida Wang, Xueyang Zhang, Yue Wang, Xiaowei Zhou, Ruizhen Hu, and Sida Peng. InfiniDepth: Arbitrary-resolution and fine- grained depth estimation with neural implicit fields. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26920–26930, 2026. 2

2026
[39]

StreetForward: Perceiv- ing dynamic street with feedforward causal attention.arXiv preprint arXiv:2603.19552, 2026

Zhongrui Yu, Zhao Wang, Yijia Xie, Yida Wang, Xueyang Zhang, Yifei Zhan, and Kun Zhan. StreetForward: Perceiv- ing dynamic street with feedforward causal attention.arXiv preprint arXiv:2603.19552, 2026. 3

work page arXiv 2026
[40]

NeW CRFs: Neural window fully-connected CRFs for monocular depth estimation

Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. NeW CRFs: Neural window fully-connected CRFs for monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3916–3925, 2022. 2

2022
[41]

CompletionFormer: Depth completion with convolutions and vision transform- ers

Youmin Zhang, Xianda Guo, Matteo Poggi, Zheng Zhu, Guan Huang, and Stefano Mattoccia. CompletionFormer: Depth completion with convolutions and vision transform- ers. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 18527– 18536, 2023. 2

2023
[42]

M2Depth: Self-supervised two-frame multi- camera metric depth estimation

Yingshuang Zou, Yikang Ding, Xi Qiu, Haoqian Wang, and Haotian Zhang. M2Depth: Self-supervised two-frame multi- camera metric depth estimation. InComputer Vision – ECCV 2024, pages 269–285, Cham, 2025. Springer Nature Switzer- land. 2

2024
[43]

DVGT: Driving visual geometry transformer

Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li, Shengyin Jiang, Long Chen, Zhi-Xin Yang, and Ji- wen Lu. DVGT: Driving visual geometry transformer. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 14658–14668,

[1] [1]

AdaBins: Depth estimation using adaptive bins

Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. AdaBins: Depth estimation using adaptive bins. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4009–4018, 2021. 2

2021

[2] [2]

Depth Pro: Sharp monocular metric depth in less than a second

Alexey Bochkovskiy, Ama ¨el Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan Richter, and Vladlen Koltun. Depth Pro: Sharp monocular metric depth in less than a second. InInternational Conference on Learning Rep- resentations, pages 75602–75637, 2025. 1, 2, 6

2025

[3] [3]

Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuScenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11621–11631, 2020. 2, 5, 6, 7, 8, 9

2020

[4] [4]

Yuanzhouhan Cao, Zifeng Wu, and Chunhua Shen. Esti- mating depth from monocular images as classification using deep fully convolutional residual networks.IEEE Transac- tions on Circuits and Systems for Video Technology, 28(11): 3174–3182, 2017. 2

2017

[5] [5]

TTT3R: 3D Reconstruction as Test-Time Training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. TTT3R: 3D reconstruction as test-time training. arXiv preprint arXiv:2509.26645, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

DGGT: Feedforward 4D reconstruction of dynamic driving scenes using unposed images

Xiaoxue Chen, Ziyi Xiong, Yuantao Chen, Gen Li, Nan Wang, Hongcheng Luo, Long Chen, Haiyang Sun, Bing Wang, Guang Chen, Hongyang Li, Ya-Qin Zhang, Hangjun Ye, and Hao Zhao. DGGT: Feedforward 4D reconstruction of dynamic driving scenes using unposed images. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages ...

2026

[7] [7]

Depth map prediction from a single image using a multi-scale deep net- work.Advances in Neural Information Processing Systems, 27, 2014

David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep net- work.Advances in Neural Information Processing Systems, 27, 2014. 2

2014

[8] [8]

Deep ordinal regression net- work for monocular depth estimation

Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat- manghelich, and Dacheng Tao. Deep ordinal regression net- work for monocular depth estimation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 2002–2011, 2018. 2

2002

[9] [9]

Unsupervised monocular depth estimation with left- right consistency

Cl ´ement Godard, Oisin Mac Aodha, and Gabriel J Bros- tow. Unsupervised monocular depth estimation with left- right consistency. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 270–279, 2017. 2

2017

[10] [10]

Digging into self-supervised monocular depth estimation

Cl ´ement Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self-supervised monocular depth estimation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 3828– 3838, 2019. 2

2019

[11] [11]

3D packing for self-supervised monocular depth estimation

Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raven- tos, and Adrien Gaidon. 3D packing for self-supervised monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2485–2494, 2020. 2, 5, 6, 7, 9

2020

[12] [12]

Dist-4D: Disentangled spatiotemporal diffusion with metric depth for 4D driving scene generation

Jiazhe Guo, Yikang Ding, Xiwu Chen, Shuo Chen, Bohan Li, Yingshuang Zou, Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Zhiheng Li, et al. Dist-4D: Disentangled spatiotemporal diffusion with metric depth for 4D driving scene generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27231–27241, 2025. 2

2025

[13] [13]

DynamicVGGT: Learning dynamic point maps for 4D scene reconstruction in autonomous driving.arXiv preprint arXiv:2603.08254, 2026

Zhuolin He, Jing Li, Guanghao Li, Xiaolei Chen, Jiacheng Tang, Siyang Zhang, Zhounan Jin, Feipeng Cai, Bin Li, Jian Pu, et al. DynamicVGGT: Learning dynamic point maps for 4D scene reconstruction in autonomous driving.arXiv preprint arXiv:2603.08254, 2026. 3

work page arXiv 2026

[14] [14]

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3D v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10579–10596, 2024. 2

2024

[15] [15]

MapAnything: Universal feed- forward metric 3D reconstruction

Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bul`o, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Universal feed- forward metric 3D reconstructio...

2026

[16] [16]

From big to small: Multi-scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2019

Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. From big to small: Multi-scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2019. 2

work page arXiv 1907

[17] [17]

Ground- ing image matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing image matching in 3D with MASt3R. InProceedings of the European Conference on Computer Vision (ECCV),

[18] [18]

Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation

Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Ji- aming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, and Bingyi Kang. Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17070–17080, 2025. 2, 4, 6

2025

[19] [19]

Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth Any- thing 3: Recovering the visual space from any views. InThe Fourteenth International Conference on Learning Represen- tations, 2026. 1, 2, 3, 4, 5, 6, 7, 8

2026

[20] [20]

VGD: Visual geometry gaussian splat- ting for feed-forward surround-view driving reconstruction

Junhong Lin, Kangli Wang, Shunzhou Wang, Songlin Fan, Ge Li, and Wei Gao. VGD: Visual geometry gaussian splat- ting for feed-forward surround-view driving reconstruction. arXiv preprint arXiv:2510.19578, 2025. 3

work page arXiv 2025

[21] [21]

Learning depth from single monocular images using deep convolutional neural fields.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 38(10):2024–2039,

Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. Learning depth from single monocular images using deep convolutional neural fields.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 38(10):2024–2039,

2024

[22] [22]

Metri- cAnything: Scaling metric depth pretraining with noisy het- erogeneous sources.arXiv preprint arXiv:2601.22054, 2026

Baorui Ma, Jiahui Yang, Donglin Di, Xuancheng Zhang, Jianxun Cui, Hao Li, Yan Xie, and Wei Chen. Metri- cAnything: Scaling metric depth pretraining with noisy het- erogeneous sources.arXiv preprint arXiv:2601.22054, 2026. 5

work page arXiv 2026

[23] [23]

UniDepth: Universal monocular metric depth estimation

Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. UniDepth: Universal monocular metric depth estimation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

2024

[24] [24]

Vi- sion transformers for dense prediction

Rene Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12179–12188, 2021. 2

2021

[25] [25]

FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

You Shen, Zhipeng Zhang, Yansong Qu, Xiawu Zheng, Ji- ayi Ji, Shengchuan Zhang, and Liujuan Cao. FastVGGT: Training-free acceleration of visual geometry transformer. arXiv preprint arXiv:2509.02560, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Driv- ingForward: Feed-forward 3D gaussian splatting for driv- ing scene reconstruction from flexible surround-view input

Qijian Tian, Xin Tan, Yuan Xie, and Lizhuang Ma. Driv- ingForward: Feed-forward 3D gaussian splatting for driv- ing scene reconstruction from flexible surround-view input. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 7374–7382, 2025. 3

2025

[27] [27]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5294–5306, 2025. 2

2025

[28] [28]

Efros, and Angjoo Kanazawa

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3D perception model with persistent state. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10510–10522, 2025. 2

2025

[29] [29]

MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5261–5271, 2025. 3, 5, 6, 7, 8

2025

[30] [30]

MoGe-2: Accurate monocular geometry with metric scale and sharp details.Advances in Neural Information Pro- cessing Systems, 38:35928–35959, 2026

Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. MoGe-2: Accurate monocular geometry with metric scale and sharp details.Advances in Neural Information Pro- cessing Systems, 38:35928–35959, 2026. 1, 2, 6, 7

2026

[31] [31]

DUSt3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 20697–20709, 2024. 2, 5

2024

[32] [32]

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chun- hua Shen, and Tong He.π 3: Permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Depth Anything with any prior

Zehan Wang, Siyu Chen, Lihe Yang, Jialei Wang, Ziang Zhang, Hengshuang Zhao, and Zhou Zhao. Depth Anything with any prior. InThe Fourteenth International Conference on Learning Representations, 2026. 2, 6, 7

2026

[34] [34]

Surround- Depth: Entangling surrounding views for self-supervised multi-camera depth estimation

Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Yong- ming Rao, Guan Huang, Jiwen Lu, and Jie Zhou. Surround- Depth: Entangling surrounding views for self-supervised multi-camera depth estimation. InProceedings of The 6th Conference on Robot Learning, pages 539–549. PMLR,

[35] [35]

Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

Tao Xie, Peishan Yang, Yudong Jin, Yingfeng Cai, Wei Yin, Weiqiang Ren, Qian Zhang, Wei Hua, Sida Peng, Xiaoyang Guo, et al. SCAL3R: Scalable test-time training for large- scale 3D reconstruction.arXiv preprint arXiv:2604.08542,

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

Depth Anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth Anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 1, 2

2024

[37] [37]

Depth Any- thing V2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth Any- thing V2. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 2

2024

[38] [38]

InfiniDepth: Arbitrary-resolution and fine- grained depth estimation with neural implicit fields

Hao Yu, Haotong Lin, Jiawei Wang, Jiaxin Li, Yida Wang, Xueyang Zhang, Yue Wang, Xiaowei Zhou, Ruizhen Hu, and Sida Peng. InfiniDepth: Arbitrary-resolution and fine- grained depth estimation with neural implicit fields. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26920–26930, 2026. 2

2026

[39] [39]

StreetForward: Perceiv- ing dynamic street with feedforward causal attention.arXiv preprint arXiv:2603.19552, 2026

Zhongrui Yu, Zhao Wang, Yijia Xie, Yida Wang, Xueyang Zhang, Yifei Zhan, and Kun Zhan. StreetForward: Perceiv- ing dynamic street with feedforward causal attention.arXiv preprint arXiv:2603.19552, 2026. 3

work page arXiv 2026

[40] [40]

NeW CRFs: Neural window fully-connected CRFs for monocular depth estimation

Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. NeW CRFs: Neural window fully-connected CRFs for monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3916–3925, 2022. 2

2022

[41] [41]

CompletionFormer: Depth completion with convolutions and vision transform- ers

Youmin Zhang, Xianda Guo, Matteo Poggi, Zheng Zhu, Guan Huang, and Stefano Mattoccia. CompletionFormer: Depth completion with convolutions and vision transform- ers. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 18527– 18536, 2023. 2

2023

[42] [42]

M2Depth: Self-supervised two-frame multi- camera metric depth estimation

Yingshuang Zou, Yikang Ding, Xi Qiu, Haoqian Wang, and Haotian Zhang. M2Depth: Self-supervised two-frame multi- camera metric depth estimation. InComputer Vision – ECCV 2024, pages 269–285, Cham, 2025. Springer Nature Switzer- land. 2

2024

[43] [43]

DVGT: Driving visual geometry transformer

Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li, Shengyin Jiang, Long Chen, Zhi-Xin Yang, and Ji- wen Lu. DVGT: Driving visual geometry transformer. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 14658–14668,