One Scene, Two Depths: Probing Geometric Ambiguity in Monocular Foundation Models

Feng Xue; Haowei Li; Matthew Johnson-Roberson; Shusheng Yang; Tianyi Zhang; Xiang Li; Xiaohao Xu; Xiaonan Huang

arxiv: 2606.29600 · v1 · pith:KTLSUUNBnew · submitted 2026-06-28 · 💻 cs.CV · cs.AI

One Scene, Two Depths: Probing Geometric Ambiguity in Monocular Foundation Models

Xiaohao Xu , Feng Xue , Xiang Li , Haowei Li , Shusheng Yang , Tianyi Zhang , Matthew Johnson-Roberson , Xiaonan Huang This is my paper

Pith reviewed 2026-06-30 06:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords monocular depth estimationgeometric ambiguitydepth foundation modelstransparent scenesmulti-layer depthLaplacian Visual PromptingMD-3k benchmark

0 comments

The pith

Monocular depth foundation models resolve the same layered scene to different depths, as revealed by a new benchmark on transparent scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that monocular depth estimation collapses layered geometry into one scalar per pixel, but this reduction reflects annotation and training conventions rather than scene-intrinsic truth, especially visible in transparent scenes where a ray can hit multiple surfaces. They introduce the MD-3k benchmark with sparse two-layer ordinal annotations to quantify each model's depth-layer preference and multi-layer spatial relationship accuracy. Under standard RGB input, leading depth foundation models show diverse preferences on the same geometry. A training-free Laplacian Visual Prompting transformation can shift the preferred layer for some frozen models, with the best RGB/LVP pair reaching 75.5 percent ML-SRA. These findings indicate that models hold complementary geometric hypotheses that standard inference leaves unexpressed.

Core claim

Depth foundation models exhibit diverse layer preferences on the same layered geometry under standard RGB input. Laplacian Visual Prompting, a training-free spectral input transformation, can substantially change the reported layer for certain frozen models. The strongest RGB/LVP pair reaches 75.5 percent ML-SRA on MD-3k, suggesting that multiple valid 3D interpretations of a scene can be measured, preserved, and expressed through an ambiguity-aware lens.

What carries the argument

The MultiDepth-3k (MD-3k) benchmark, which uses sparse two-layer ordinal annotations on transparent scenes to measure depth-layer preference and multi-layer spatial relationship accuracy (ML-SRA).

Load-bearing premise

The sparse two-layer ordinal annotations and ML-SRA metric in MD-3k provide a reliable, annotation-independent measure of geometric layer preference.

What would settle it

All leading depth models producing identical layer selections on MD-3k under both standard RGB and LVP inputs, or ML-SRA scores failing to correlate with actual multi-surface visibility in controlled ray-tracing tests.

Figures

Figures reproduced from arXiv: 2606.29600 by Feng Xue, Haowei Li, Matthew Johnson-Roberson, Shusheng Yang, Tianyi Zhang, Xiang Li, Xiaohao Xu, Xiaonan Huang.

**Figure 2.** Figure 2: Model-dependent depth-layer modulation. Standard RGB input reveals each model’s default depth layer preference (Cols. 2 and 7). Laplacian Visual Prompting can change the reported layer for receptive frozen models [18,25,37,38] and produce a candidate complementary depth hypothesis in ambiguous regions (Cols. 4 and 9). does not learn layer-free geometry; it learns a depth-layer preference, the layer it ten… view at source ↗

**Figure 3.** Figure 3: Each pair has labels for both layers. Masks and labels were cross-checked by multiple annotators in multiple review rounds before evaluation. Ratio of Ambiguous Area to Whole Image Area Count Probability of Ambiguous Region [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Statistics. Left: distribution of ambiguous area ratio per image. Right: 2D spatial heatmap of ambiguous regions over benchmark images in MD-3k, shown in normalized image coordinates. Benchmark statistics. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: The Laplacian Visual Prompting (LVP) method. (a) Standard model training couples RGB to single-layer depth. (b) At inference, the standard RGB input yields a single depth estimate, which is biased for ambiguous scenes. (c) LVP transforms the input via per-channel floatingpoint convolution with the Laplacian kernel, followed by min–max mapping back to the image-input value range, producing a candidate … view at source ↗

**Figure 6.** Figure 6: Model-dependent depth-layer preference. On MD-3k Reverse, each row links RGB (circle) and LVP (triangle). Fill indicates the preferred layer (red: foreground; blue: background), and crossing α = 0 indicates a layer change. The varied endpoints and shifts expose model-specific RGB priors and LVP responses. purpose DAv2 (DAv2-S/B/L) and indoor-tuned DAv2-I variants favor the first layer (transparent foregrou… view at source ↗

**Figure 7.** Figure 7: Feature visualization. PCA of DAv2-L encoder and decoder features. Under LVP input (Bottom), activations place greater emphasis on background highfrequency edges than under RGB input (Top). This qualitative, input-dependent feature highlighting is not evidence of discrete latent depth layers. candidate pair across all models. To contextualize these results, we define an Ideal Collapsed Baseline: a hypoth… view at source ↗

**Figure 8.** Figure 8: Scaling analysis. (a) On the Reverse subset of MD-3k, larger variants benefit most when RGB and LVP select different layers; when both inputs share the same layer bias, the candidate pair offers less complementarity. (b) On Same subset of MD-3k and DA-2K, we plot the RGB/LVP ML-SRA gap in percentage points. Smaller bars mean that LVP stays closer to the RGB baseline when the ordinal relation is consistent.… view at source ↗

**Figure 9.** Figure 9: Ablation of LVP design. Relative change in ML-SRA [%] compared to default LVP. Performance is robust to kernel variants (LVP-2: 8-neighbor), sign flip of Laplacian kernel (LVP-R), and grayscale input (LVP-G) [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Downstream illustrations. Selected RGB/LVP-conditioned depth hypotheses provide alternative ControlNet conditions and frame-wise depth streams. (a) Scenes with Curved Transparent Surfaces (b) Scenes with Semi-Transparent Surfaces [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Generalization and failure cases of LVP. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

read the original abstract

A faithful 3D world representation should account for layered geometry, where a single camera ray may contain multiple visible and geometrically valid surfaces. Monocular depth estimation, however, reduces this structure to one scalar depth per pixel. Transparent scenes make this ambiguity measurable: the same ray can pass through foreground glass and observe the background, turning the supervised target into a convention of annotation, data, and training rather than a scene-intrinsic truth. A learned predictor exposes this convention as its depth-layer preference. We introduce MultiDepth-3k (MD-3k), a sparse two-layer ordinal benchmark for measuring depth-layer preference and multi-layer spatial relationship accuracy (ML-SRA). On MD-3k, leading depth foundation models exhibit diverse layer preferences under standard RGB input, showing that the same layered geometry can be resolved differently across models. We further find that Laplacian Visual Prompting (LVP), a training-free spectral input transformation, can substantially change the reported layer for certain frozen models. The strongest RGB/LVP pair, DAv2-L, reaches 75.5% ML-SRA. These results suggest that depth foundation models may express complementary geometric hypotheses that standard RGB inference leaves unexpressed. We invite the community to rethink depth supervision and evaluation through an ambiguity-aware lens, where multiple valid 3D interpretations are treated as geometric structure to be measured, preserved, and expressed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces MD-3k to show depth foundation models pick different layers on the same transparent scenes and that LVP can flip some choices, but the benchmark's annotation independence needs verification.

read the letter

The main thing here is that leading monocular depth models resolve the same layered scene differently under RGB input, and a training-free Laplacian Visual Prompting step can shift the layer some frozen models report. On their new MD-3k benchmark the best RGB/LVP combination reaches 75.5% ML-SRA.

What is new is the sparse two-layer ordinal benchmark itself plus the measurements of layer preference across models and the LVP intervention results. The abstract makes clear these are not in the referenced prior work. The paper does a solid job flagging that depth supervision in transparent or layered cases often reduces to annotation convention rather than scene geometry, and it gives a concrete way to measure that preference.

The soft spot is exactly the one in the stress-test note. The central claim requires MD-3k to act as an annotation-independent probe. If the two-layer labels were collected under conventions close to those in the models' pre-training data, the reported diversity and the LVP effect could simply track dataset overlap instead of exposing genuine ambiguity. The supplied text gives no details on annotation protocol, validation against training distributions, or statistical controls, so the support for the claim stays limited until those sections are checked.

This is for people working on monocular depth foundation models and their evaluation. A reader who wants empirical probes of model behavior on ambiguous geometry would get value from the results if the benchmark holds up. It deserves peer review to examine the annotation process and reproducibility; the idea is worth testing even if the current evidence is preliminary.

Referee Report

2 major / 2 minor

Summary. The paper introduces the MultiDepth-3k (MD-3k) benchmark of sparse two-layer ordinal annotations on transparent/layered scenes to measure depth foundation models' layer preferences. It reports diverse preferences across models under standard RGB input and shows that Laplacian Visual Prompting (LVP) can alter the preferred layer for frozen models, with the strongest RGB/LVP pair (DAv2-L) reaching 75.5% ML-SRA. The central claim is that these models express complementary geometric hypotheses on the same scene that standard inference leaves unexpressed.

Significance. If MD-3k validly isolates intrinsic layer preference, the work would usefully demonstrate that current monocular depth models encode different geometric conventions on ambiguous scenes and that training-free input transformations like LVP can surface alternative hypotheses. The empirical probe on multiple foundation models and the introduction of an ambiguity-aware metric are strengths; the training-free character of LVP is also noted positively.

major comments (2)

[§3] §3 (MD-3k construction): No details are provided on the annotation protocol, inter-annotator agreement, or explicit steps taken to ensure the two-layer ordinal labels are collected orthogonally to conventions in common depth-training corpora (e.g., foreground-glass vs. background prioritization). This is load-bearing for the claim that observed preferences and LVP effects reflect geometric ambiguity rather than dataset overlap.
[§4] §4 (experimental results): The reported 75.5% ML-SRA and claims of 'diverse' layer preferences across models lack reported statistical controls (confidence intervals, significance tests, or model-selection criteria), making it difficult to assess whether the diversity is robust or could be explained by benchmark construction choices.

minor comments (2)

[§3] The definition and computation of the ML-SRA metric on sparse annotations should be stated more explicitly, including how ties or missing layers are handled.
Figure captions and axis labels for layer-preference visualizations could be clarified to indicate whether percentages are normalized per model or across the benchmark.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (MD-3k construction): No details are provided on the annotation protocol, inter-annotator agreement, or explicit steps taken to ensure the two-layer ordinal labels are collected orthogonally to conventions in common depth-training corpora (e.g., foreground-glass vs. background prioritization). This is load-bearing for the claim that observed preferences and LVP effects reflect geometric ambiguity rather than dataset overlap.

Authors: We agree that additional details on MD-3k construction are necessary to support the central claims. In the revised manuscript we will expand §3 with a full description of the annotation protocol (including annotator instructions and collection procedure), report inter-annotator agreement, and document the explicit steps taken to select scenes and instruct annotators so that labels remain orthogonal to common depth-training conventions such as foreground prioritization. revision: yes
Referee: [§4] §4 (experimental results): The reported 75.5% ML-SRA and claims of 'diverse' layer preferences across models lack reported statistical controls (confidence intervals, significance tests, or model-selection criteria), making it difficult to assess whether the diversity is robust or could be explained by benchmark construction choices.

Authors: We concur that statistical controls would improve interpretability. The revised §4 will include bootstrap confidence intervals for all ML-SRA scores, pairwise significance tests on layer-preference differences, and an explicit statement of model-selection criteria. These additions will allow readers to evaluate whether the observed diversity is robust. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on newly introduced benchmark

full rationale

The paper introduces MD-3k as a new sparse two-layer ordinal benchmark and reports empirical results (layer preferences, ML-SRA scores) for existing frozen models under RGB and LVP inputs. No equations, derivations, or fitted parameters are present that reduce any reported metric to quantities defined from the same evaluation data. No self-citation chains or uniqueness theorems are invoked to justify core claims. The work is self-contained as an observational probe; observed diversity in model outputs on MD-3k does not reduce to annotation conventions by construction within the paper's text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, invented entities, or additional axioms beyond the domain premise that depth estimation collapses layered geometry are visible in the abstract.

axioms (1)

domain assumption Monocular depth estimation reduces layered geometry to one scalar depth per pixel.
This premise frames the entire evaluation and is stated in the opening of the abstract.

pith-pipeline@v0.9.1-grok · 5806 in / 1292 out tokens · 32769 ms · 2026-06-30T06:56:16.998618+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 4 canonical work pages · 2 internal anchors

[1]

arXiv preprint arXiv:2203.17274 (2022)

Bahng, H., Jahanian, A., Sankaranarayanan, S., Isola, P.: Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274 (2022)

work page arXiv 2022
[2]

In: CVPR

Bai, Y., Geng, X., Mangalam, K., Bar, A., Yuille, A.L., Darrell, T., Malik, J., Efros, A.A.: Sequential modeling enables scalable learning for large vision models. In: CVPR. pp. 22861–22872 (2024)

2024
[3]

In: CVPR (2021)

Bhat, S.F., Alhashim, I., Wonka, P.: Adabins: Depth estimation using adaptive bins. In: CVPR (2021)

2021
[4]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Bhat, S.F., Birkl, R., Wofk, D., Wonka, P., Müller, M.: Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv:2302.12288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

1–a model zoo for robust monocular relative depth estimation

Birkl, R., Wofk, D., Müller, M.: Midas v3. 1–a model zoo for robust monocular relative depth estimation. arXiv:2307.14460 (2023)

work page arXiv 2023
[6]

In: ICLR (2025)

Bochkovskiy, A., Delaunoy, A., Germain, H., Santos, M., Zhou, Y., Richter, S., Koltun, V.: Depth pro: Sharp monocular metric depth in less than a second. In: ICLR (2025)

2025
[7]

NeurIPS33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. NeurIPS33, 1877–1901 (2020)

1901
[8]

In: ICASSP

Chen, A., Lorenz, P., Yao, Y., Chen, P.Y., Liu, S.: Visual prompting for adversarial robustness. In: ICASSP. pp. 1–5. IEEE (2023)

2023
[9]

In: ICRA

Chen, K., Wang, S., Xia, B., Li, D., Kan, Z., Li, B.: TODE-Trans: Transparent object depth estimation with transformer. In: ICRA. pp. 4880–4886 (2023)

2023
[10]

In: CIKM

Chen, L., Fan, Y., Ye, Y.: Adversarial reprogramming of pretrained neural networks for fraud detection. In: CIKM. pp. 2935–2939 (2021)

2021
[11]

In: NeurIPS (2016)

Chen, W., Fu, Z., Yang, D., Deng, J.: Single-image depth perception in the wild. In: NeurIPS (2016)

2016
[12]

In: NeurIPS (2014)

Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NeurIPS (2014)

2014
[13]

IEEE Robotics and Automation Letters7(3), 7383–7390 (2022)

Fang, H., Fang, H.S., Xu, S., Lu, C.: Transcg: A large-scale real-world dataset for transparent object depth completion and a grasping baseline. IEEE Robotics and Automation Letters7(3), 7383–7390 (2022)

2022
[14]

ECCV (2024)

Fu, X., Yin, W., Hu, M., Wang, K., Ma, Y., Tan, P., Shen, S., Lin, D., Long, X.: Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. ECCV (2024)

2024
[15]

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J.W., Wallach, H., Daumé III, H., Crawford, K.: Datasheets for datasets. Commun. ACM64(12), 86–92 (nov 2021)

2021
[16]

IJRR (2013)

Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. IJRR (2013)

2013
[17]

In: AAAI (2025)

Gui, M., Schusterbauer, J., Prestel, U., Ma, P., Kotovenko, D., Grebenkova, O., Baumann, S.A., Hu, V.T., Ommer, B.: DepthFM: Fast monocular depth estimation with flow matching. In: AAAI (2025)

2025
[18]

In: CVPR (2024)

Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Repur- posing diffusion-based image generators for monocular depth estimation. In: CVPR (2024)

2024
[19]

CVPR (2023)

Khattak, M.U., Rasheed, H., Maaz, M., Khan, S., Khan, F.S.: Maple: Multi-modal prompt learning. CVPR (2023)

2023
[20]

TPAMI (2023) One Scene, Two Depths 17

Liang, Y., Deng, B., Liu, W., Qin, J., He, S.: Monocular depth estimation for glass walls with context: a new dataset and method. TPAMI (2023) One Scene, Two Depths 17

2023
[21]

In: CVPR

Mei, H., Yang, X., Wang, Y., Liu, Y., He, S., Zhang, Q., Wei, X., Lau, R.W.: Don’t hit me! glass detection in real-world scenes. In: CVPR. pp. 3687–3696 (2020)

2020
[22]

In: WACV

Neekhara, P., Hussain, S., Du, J., Dubnov, S., Koushanfar, F., McAuley, J.: Cross- modal adversarial reprogramming. In: WACV. pp. 2427–2435 (2022)

2022
[23]

In: CVPR

Piccinelli, L., Sakaridis, C., Segu, M., Yang, Y.H., Li, S., Abbeloos, W., Van Gool, L.: Unik3d: Universal camera monocular 3d estimation. In: CVPR. pp. 1028–1039 (2025)

2025
[24]

UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler

Piccinelli, L., Sakaridis, C., Yang, Y.H., Segu, M., Li, S., Abbeloos, W., Van Gool, L.: Unidepthv2: Universal monocular metric depth estimation made simpler. arXiv preprint arXiv:2502.20110 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

In: ICCV (2021)

Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: ICCV (2021)

2021
[26]

TPAMI (2022)

Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. TPAMI (2022)

2022
[27]

In: ICCV (2021)

Roberts, M., Ramapuram, J., Ranjan, A., Kumar, A., Bautista, M.A., Paczan, N., Webb, R., Susskind, J.M.: Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In: ICCV (2021)

2021
[28]

In: CVPR (2022)

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

2022
[29]

In: ICRA

Sajjan, S., Moore, M., Pan, M., Nagaraja, G., Lee, J., Zeng, A., Song, S.: ClearGrasp: 3D shape estimation of transparent objects for manipulation. In: ICRA. pp. 3634– 3642 (2020)

2020
[30]

In: CVPR (2017)

Schops, T., Schonberger, J.L., Galliani, S., Sattler, T., Schindler, K., Pollefeys, M., Geiger, A.: A multi-view stereo benchmark with high-resolution images and multi-camera videos. In: CVPR (2017)

2017
[31]

In: ECCV (2012)

Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: ECCV (2012)

2012
[32]

In: ICCV Workshops

Singha, M., Pal, H., Jha, A., Banerjee, B.: AD-CLIP: Adapting domains in prompt space using CLIP. In: ICCV Workshops. pp. 4355–4364 (2023)

2023
[33]

ICML (2020)

Tsai, Y.Y., Chen, P.Y., Ho, T.Y.: Transfer learning without knowing: Reprogram- ming black-box machine learning models with scarce data and limited resources. ICML (2020)

2020
[34]

In: AAAI (2024)

Wang, H., Liu, F., Jiao, L., Wang, J., Hao, Z., Li, S., Li, L., Chen, P., Liu, X.: Vilt-clip: Video and language tuning clip with multimodal prompt learning and scenario-guided optimization. In: AAAI (2024)

2024
[35]

In: CVPR

Wasim, S.T., Naseer, M., Khan, S., Khan, F.S., Shah, M.: Vita-clip: Video and text adaptive clip via multimodal prompting. In: CVPR. pp. 23034–23044 (2023)

2023
[36]

ICCV (2025)

Wen, H., Zuo, Y., Subramanian, V., Chen, P., Deng, J.: Seeing and seeing through the glass: Real and synthetic data for multi-layer depth estimation. ICCV (2025)

2025
[37]

In: CVPR (2024)

Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: CVPR (2024)

2024
[38]

NeurIPS (2024)

Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. NeurIPS (2024)

2024
[39]

In: ICCV (2023)

Yin, W., Zhang, C., Chen, H., Cai, Z., Yu, G., Wang, K., Chen, X., Shen, C.: Metric3d: Towards zero-shot metric 3d prediction from a single image. In: ICCV (2023)

2023
[40]

In: ICCV (2023) 18 Xu et al

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV (2023) 18 Xu et al

2023
[41]

foreground

Zhu, L., Mousavian, A., Xiang, Y., Mazhar, H., van Eenbergen, J., Debnath, S., Fox, D.: RGB-D local implicit function for depth completion of transparent objects. In: CVPR. pp. 12725–12734 (2021) One Scene, Two Depths 19 Table A: ML-SRA onMD-3k with alternative high-frequency prompts.Each cell reportsOverall/Reverse/Same[%]. The effect generalizes beyond ...

2021

[1] [1]

arXiv preprint arXiv:2203.17274 (2022)

Bahng, H., Jahanian, A., Sankaranarayanan, S., Isola, P.: Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274 (2022)

work page arXiv 2022

[2] [2]

In: CVPR

Bai, Y., Geng, X., Mangalam, K., Bar, A., Yuille, A.L., Darrell, T., Malik, J., Efros, A.A.: Sequential modeling enables scalable learning for large vision models. In: CVPR. pp. 22861–22872 (2024)

2024

[3] [3]

In: CVPR (2021)

Bhat, S.F., Alhashim, I., Wonka, P.: Adabins: Depth estimation using adaptive bins. In: CVPR (2021)

2021

[4] [4]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Bhat, S.F., Birkl, R., Wofk, D., Wonka, P., Müller, M.: Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv:2302.12288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

1–a model zoo for robust monocular relative depth estimation

Birkl, R., Wofk, D., Müller, M.: Midas v3. 1–a model zoo for robust monocular relative depth estimation. arXiv:2307.14460 (2023)

work page arXiv 2023

[6] [6]

In: ICLR (2025)

Bochkovskiy, A., Delaunoy, A., Germain, H., Santos, M., Zhou, Y., Richter, S., Koltun, V.: Depth pro: Sharp monocular metric depth in less than a second. In: ICLR (2025)

2025

[7] [7]

NeurIPS33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. NeurIPS33, 1877–1901 (2020)

1901

[8] [8]

In: ICASSP

Chen, A., Lorenz, P., Yao, Y., Chen, P.Y., Liu, S.: Visual prompting for adversarial robustness. In: ICASSP. pp. 1–5. IEEE (2023)

2023

[9] [9]

In: ICRA

Chen, K., Wang, S., Xia, B., Li, D., Kan, Z., Li, B.: TODE-Trans: Transparent object depth estimation with transformer. In: ICRA. pp. 4880–4886 (2023)

2023

[10] [10]

In: CIKM

Chen, L., Fan, Y., Ye, Y.: Adversarial reprogramming of pretrained neural networks for fraud detection. In: CIKM. pp. 2935–2939 (2021)

2021

[11] [11]

In: NeurIPS (2016)

Chen, W., Fu, Z., Yang, D., Deng, J.: Single-image depth perception in the wild. In: NeurIPS (2016)

2016

[12] [12]

In: NeurIPS (2014)

Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NeurIPS (2014)

2014

[13] [13]

IEEE Robotics and Automation Letters7(3), 7383–7390 (2022)

Fang, H., Fang, H.S., Xu, S., Lu, C.: Transcg: A large-scale real-world dataset for transparent object depth completion and a grasping baseline. IEEE Robotics and Automation Letters7(3), 7383–7390 (2022)

2022

[14] [14]

ECCV (2024)

Fu, X., Yin, W., Hu, M., Wang, K., Ma, Y., Tan, P., Shen, S., Lin, D., Long, X.: Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. ECCV (2024)

2024

[15] [15]

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J.W., Wallach, H., Daumé III, H., Crawford, K.: Datasheets for datasets. Commun. ACM64(12), 86–92 (nov 2021)

2021

[16] [16]

IJRR (2013)

Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. IJRR (2013)

2013

[17] [17]

In: AAAI (2025)

Gui, M., Schusterbauer, J., Prestel, U., Ma, P., Kotovenko, D., Grebenkova, O., Baumann, S.A., Hu, V.T., Ommer, B.: DepthFM: Fast monocular depth estimation with flow matching. In: AAAI (2025)

2025

[18] [18]

In: CVPR (2024)

Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Repur- posing diffusion-based image generators for monocular depth estimation. In: CVPR (2024)

2024

[19] [19]

CVPR (2023)

Khattak, M.U., Rasheed, H., Maaz, M., Khan, S., Khan, F.S.: Maple: Multi-modal prompt learning. CVPR (2023)

2023

[20] [20]

TPAMI (2023) One Scene, Two Depths 17

Liang, Y., Deng, B., Liu, W., Qin, J., He, S.: Monocular depth estimation for glass walls with context: a new dataset and method. TPAMI (2023) One Scene, Two Depths 17

2023

[21] [21]

In: CVPR

Mei, H., Yang, X., Wang, Y., Liu, Y., He, S., Zhang, Q., Wei, X., Lau, R.W.: Don’t hit me! glass detection in real-world scenes. In: CVPR. pp. 3687–3696 (2020)

2020

[22] [22]

In: WACV

Neekhara, P., Hussain, S., Du, J., Dubnov, S., Koushanfar, F., McAuley, J.: Cross- modal adversarial reprogramming. In: WACV. pp. 2427–2435 (2022)

2022

[23] [23]

In: CVPR

Piccinelli, L., Sakaridis, C., Segu, M., Yang, Y.H., Li, S., Abbeloos, W., Van Gool, L.: Unik3d: Universal camera monocular 3d estimation. In: CVPR. pp. 1028–1039 (2025)

2025

[24] [24]

UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler

Piccinelli, L., Sakaridis, C., Yang, Y.H., Segu, M., Li, S., Abbeloos, W., Van Gool, L.: Unidepthv2: Universal monocular metric depth estimation made simpler. arXiv preprint arXiv:2502.20110 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

In: ICCV (2021)

Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: ICCV (2021)

2021

[26] [26]

TPAMI (2022)

Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. TPAMI (2022)

2022

[27] [27]

In: ICCV (2021)

Roberts, M., Ramapuram, J., Ranjan, A., Kumar, A., Bautista, M.A., Paczan, N., Webb, R., Susskind, J.M.: Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In: ICCV (2021)

2021

[28] [28]

In: CVPR (2022)

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

2022

[29] [29]

In: ICRA

Sajjan, S., Moore, M., Pan, M., Nagaraja, G., Lee, J., Zeng, A., Song, S.: ClearGrasp: 3D shape estimation of transparent objects for manipulation. In: ICRA. pp. 3634– 3642 (2020)

2020

[30] [30]

In: CVPR (2017)

Schops, T., Schonberger, J.L., Galliani, S., Sattler, T., Schindler, K., Pollefeys, M., Geiger, A.: A multi-view stereo benchmark with high-resolution images and multi-camera videos. In: CVPR (2017)

2017

[31] [31]

In: ECCV (2012)

Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: ECCV (2012)

2012

[32] [32]

In: ICCV Workshops

Singha, M., Pal, H., Jha, A., Banerjee, B.: AD-CLIP: Adapting domains in prompt space using CLIP. In: ICCV Workshops. pp. 4355–4364 (2023)

2023

[33] [33]

ICML (2020)

Tsai, Y.Y., Chen, P.Y., Ho, T.Y.: Transfer learning without knowing: Reprogram- ming black-box machine learning models with scarce data and limited resources. ICML (2020)

2020

[34] [34]

In: AAAI (2024)

Wang, H., Liu, F., Jiao, L., Wang, J., Hao, Z., Li, S., Li, L., Chen, P., Liu, X.: Vilt-clip: Video and language tuning clip with multimodal prompt learning and scenario-guided optimization. In: AAAI (2024)

2024

[35] [35]

In: CVPR

Wasim, S.T., Naseer, M., Khan, S., Khan, F.S., Shah, M.: Vita-clip: Video and text adaptive clip via multimodal prompting. In: CVPR. pp. 23034–23044 (2023)

2023

[36] [36]

ICCV (2025)

Wen, H., Zuo, Y., Subramanian, V., Chen, P., Deng, J.: Seeing and seeing through the glass: Real and synthetic data for multi-layer depth estimation. ICCV (2025)

2025

[37] [37]

In: CVPR (2024)

Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: CVPR (2024)

2024

[38] [38]

NeurIPS (2024)

Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. NeurIPS (2024)

2024

[39] [39]

In: ICCV (2023)

Yin, W., Zhang, C., Chen, H., Cai, Z., Yu, G., Wang, K., Chen, X., Shen, C.: Metric3d: Towards zero-shot metric 3d prediction from a single image. In: ICCV (2023)

2023

[40] [40]

In: ICCV (2023) 18 Xu et al

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV (2023) 18 Xu et al

2023

[41] [41]

foreground

Zhu, L., Mousavian, A., Xiang, Y., Mazhar, H., van Eenbergen, J., Debnath, S., Fox, D.: RGB-D local implicit function for depth completion of transparent objects. In: CVPR. pp. 12725–12734 (2021) One Scene, Two Depths 19 Table A: ML-SRA onMD-3k with alternative high-frequency prompts.Each cell reportsOverall/Reverse/Same[%]. The effect generalizes beyond ...

2021