pith. machine review for the scientific record. sign in

arxiv: 2406.09414 · v2 · submitted 2024-06-13 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Depth Anything V2

Lihe Yang , Bingyi Kang , Zilong Huang , Zhen Zhao , Xiaogang Xu , Jiashi Feng , Hengshuang Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-13 14:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords monocular depth estimationsynthetic datapseudo-labelingteacher-student distillationdepth predictionmodel scalingcomputer vision
0
0 comments X

The pith

Depth Anything V2 produces finer and more robust monocular depth predictions than V1 by training exclusively on synthetic images and pseudo-labeled real data from a scaled teacher.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The work shows that monocular depth estimation improves when all real labeled images are replaced by synthetic ones. A larger teacher model trained on the synthetic set generates pseudo-labels for a large collection of real images. Student models trained on those pseudo-labels then deliver depth maps that are both finer in detail and more stable across scenes than those from the prior version. The resulting models run more than ten times faster than Stable Diffusion-based alternatives while reaching higher accuracy. Models spanning 25 million to 1.3 billion parameters are released, along with a new diverse benchmark that supplies precise annotations for future testing.

Core claim

By training a scaled teacher solely on synthetic images and then using the teacher to label large numbers of real images, the resulting student models produce significantly finer and more robust depth predictions than Depth Anything V1 while remaining far more efficient than diffusion-based depth estimators.

What carries the argument

A teacher-student distillation pipeline in which a large teacher trained on synthetic images generates pseudo-labels that bridge to the training of smaller student models on real photographs.

If this is right

  • Models from 25M to 1.3B parameters support applications with different speed and accuracy needs.
  • Fine-tuning the same backbone on metric depth labels yields accurate absolute-depth outputs.
  • Inference speed exceeds that of Stable Diffusion depth models by more than a factor of ten.
  • A new benchmark with precise ground truth and broad scene coverage replaces older limited test sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • High-quality synthetic data can substitute for scarce real labels in dense prediction tasks.
  • Teacher capacity scaling appears decisive for producing pseudo-labels that transfer reliably.
  • The same synthetic-to-pseudo-label route may improve related tasks such as normal estimation or optical flow.

Load-bearing premise

Synthetic images plus pseudo-labels from a scaled teacher will generalize to diverse real scenes without introducing systematic biases that hurt performance.

What would settle it

A controlled test set of real images from previously unseen scene types where a model trained on real labeled data records lower error rates than V2 would falsify the claim that the synthetic-plus-pseudo-label route is superior.

read the original abstract

This work presents Depth Anything V2. Without pursuing fancy techniques, we aim to reveal crucial findings to pave the way towards building a powerful monocular depth estimation model. Notably, compared with V1, this version produces much finer and more robust depth predictions through three key practices: 1) replacing all labeled real images with synthetic images, 2) scaling up the capacity of our teacher model, and 3) teaching student models via the bridge of large-scale pseudo-labeled real images. Compared with the latest models built on Stable Diffusion, our models are significantly more efficient (more than 10x faster) and more accurate. We offer models of different scales (ranging from 25M to 1.3B params) to support extensive scenarios. Benefiting from their strong generalization capability, we fine-tune them with metric depth labels to obtain our metric depth models. In addition to our models, considering the limited diversity and frequent noise in current test sets, we construct a versatile evaluation benchmark with precise annotations and diverse scenes to facilitate future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents Depth Anything V2, which improves monocular depth estimation over V1 by replacing labeled real images with synthetic data for teacher training, scaling the teacher model capacity, and using the teacher to generate pseudo-labels on large-scale real images for student training. It claims significantly finer and more robust predictions, over 10x faster inference and higher accuracy than Stable Diffusion-based models, provides models from 25M to 1.3B parameters, and introduces a new diverse evaluation benchmark with precise annotations.

Significance. If the empirical claims hold, the work offers a practical route to high-quality, efficient depth models that leverage synthetic data and pseudo-labeling at scale, with clear benefits for real-time applications. The provision of multiple model scales and a new benchmark with diverse scenes and precise annotations would support further research in the field.

major comments (2)
  1. [Abstract] Abstract: the central claim that the three practices (synthetic-only teacher training, teacher scaling, and pseudo-label bridge) produce finer and more robust predictions than V1 or Stable Diffusion baselines rests on unverified experimental outcomes; no quantitative tables, ablation details isolating each practice, or error analysis on challenging real scenes are referenced.
  2. [Abstract] Abstract: the generalization advantage requires that the scaled synthetic-trained teacher generates pseudo-labels without systematic bias on real-world phenomena absent from synthetic data (e.g., complex reflections, low-light gradients, fine occlusion boundaries); without explicit tests or analysis addressing this, the student may inherit errors that undermine the reported robustness gains.
minor comments (1)
  1. [Abstract] Abstract: model parameter counts (25M to 1.3B) are listed but without corresponding speed/accuracy trade-offs or per-scale benchmark numbers to guide practitioners.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, with references to the specific experimental results and sections that support our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the three practices (synthetic-only teacher training, teacher scaling, and pseudo-label bridge) produce finer and more robust predictions than V1 or Stable Diffusion baselines rests on unverified experimental outcomes; no quantitative tables, ablation details isolating each practice, or error analysis on challenging real scenes are referenced.

    Authors: The abstract is a concise summary; the full manuscript contains the supporting experiments. Table 1 reports direct quantitative comparisons against Depth Anything V1 and Stable Diffusion-based models on multiple benchmarks, showing consistent gains in accuracy and >10x inference speed. Section 4.2 presents ablations that isolate each of the three practices (synthetic-only teacher training, teacher scaling, and pseudo-label bridge) with corresponding metrics. Section 5.3 provides both quantitative and qualitative error analysis on challenging real scenes, including fine structures and robustness under varying conditions. We will revise the abstract to explicitly reference these tables and sections. revision: partial

  2. Referee: [Abstract] Abstract: the generalization advantage requires that the scaled synthetic-trained teacher generates pseudo-labels without systematic bias on real-world phenomena absent from synthetic data (e.g., complex reflections, low-light gradients, fine occlusion boundaries); without explicit tests or analysis addressing this, the student may inherit errors that undermine the reported robustness gains.

    Authors: Our evaluation protocol directly tests generalization on real-world data containing the cited phenomena. The new benchmark introduced in Section 6 comprises diverse scenes with precise annotations that explicitly include complex reflections, low-light gradients, and fine occlusion boundaries. Tables 3 and 4 report that models trained via the pseudo-label bridge outperform both V1 and Stable Diffusion baselines on these subsets, with no evidence of systematic error inheritance. The large-scale real-image pseudo-labeling step is designed to adapt the student to real distributions, and the reported robustness improvements are measured on exactly these challenging cases. revision: no

Circularity Check

0 steps flagged

No significant circularity in empirical training pipeline

full rationale

The paper reports an empirical training recipe for monocular depth estimation: synthetic images replace real labeled data for the teacher, the teacher is scaled, and students are trained on its pseudo-labels on real images. These are design choices whose validity is assessed by external benchmarking on diverse test sets rather than any derivation that reduces to its own inputs by construction. No equations, uniqueness theorems, or self-citations are invoked to force the result; the central claims rest on measured accuracy and efficiency gains against independent baselines.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that synthetic data distribution plus pseudo-labels from a larger teacher capture real-world depth statistics better than existing real labeled sets. No new physical entities or mathematical axioms are introduced.

free parameters (2)
  • teacher model capacity
    Scaled up relative to V1; exact parameter count and training hyperparameters chosen to maximize downstream student performance.
  • pseudo-label generation scale
    Large-scale real images labeled by teacher; volume and selection criteria are design choices.

pith-pipeline@v0.9.0 · 5490 in / 1196 out tokens · 52180 ms · 2026-05-13T14:51:50.033533+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Adaptive Context Matters: Towards Provable Multi-Modality Guidance for Super-Resolution

    cs.CV 2026-05 unverdicted novelty 7.0

    The paper provides the first theoretical analysis of multi-modal super-resolution and proposes M³ESR, a mixture-of-experts framework with spatially dynamic and temporally adaptive modality weighting that improves gene...

  2. DP-SfM: Dual-Pixel Structure-from-Motion without Scale Ambiguity

    cs.CV 2026-05 unverdicted novelty 7.0

    Dual-pixel defocus blur enables absolute scale estimation in SfM without reference objects or calibration.

  3. Triangulation of Points Constrained to a Plane

    math.AG 2026-04 unverdicted novelty 7.0

    A closed-form formula is derived for the number of complex critical points in the planar triangulation problem, valid for any number of views.

  4. Face Anything: 4D Face Reconstruction from Any Image Sequence

    cs.CV 2026-04 unverdicted novelty 7.0

    A single transformer model jointly predicts depth and normalized canonical coordinates to deliver state-of-the-art 4D facial geometry and tracking with 3x lower correspondence error and 16% better depth accuracy.

  5. Realizing Immersive Volumetric Video: A Multimodal Framework for 6-DoF VR Engagement

    cs.CV 2026-04 unverdicted novelty 7.0

    The paper presents a multimodal framework, dataset, and reconstruction pipeline to create immersive volumetric videos supporting large 6-DoF audiovisual interaction from real multi-view captures.

  6. KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

    cs.RO 2026-04 unverdicted novelty 7.0

    KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.

  7. Training a Student Expert via Semi-Supervised Foundation Model Distillation

    cs.CV 2026-04 conditional novelty 7.0

    A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.

  8. Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image

    cs.CV 2026-05 unverdicted novelty 6.0

    Sat3DGen improves geometric RMSE from 6.76m to 5.20m and FID from ~40 to 19 for street-level 3D generation from satellite images via geometry-centric constraints and perspective training.

  9. DegBins: Degradation-Driven Binning for Depth Super-Resolution

    cs.CV 2026-05 unverdicted novelty 6.0

    DegBins uses degradation-driven binning and multi-stage refinement to turn residual depth regression into a more flexible hybrid classification-regression problem that outperforms prior depth super-resolution methods ...

  10. Sat3R: Satellite DSM Reconstruction via RPC-Aware Depth Fine-tuning

    cs.CV 2026-05 unverdicted novelty 6.0

    Sat3R adapts Depth Anything V2 via RPC-aware metric depth fine-tuning to deliver satellite DSM reconstruction with 38% lower MAE than zero-shot baselines and over 300x speedup versus optimization methods.

  11. MLG-Stereo: ViT Based Stereo Matching with Multi-Stage Local-Global Enhancement

    cs.CV 2026-04 unverdicted novelty 6.0

    MLG-Stereo adds multi-granularity feature extraction, local-global cost volumes, and guided recurrent refinement to ViT stereo matching, yielding competitive results on Middlebury, KITTI-2015, and strong results on KI...

  12. Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective

    cs.CV 2026-04 unverdicted novelty 6.0

    The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...

  13. LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.

  14. SimpleProc: Fully Procedural Synthetic Data from Simple Rules for Multi-View Stereo

    cs.CV 2026-04 unverdicted novelty 6.0

    Procedural rules with NURBS generate MVS training data that outperforms same-scale manual curation and matches or exceeds larger manual datasets.

  15. DINO-VO: Learning Where to Focus for Enhanced State Estimation

    cs.CV 2026-04 unverdicted novelty 6.0

    DINO-VO achieves state-of-the-art monocular visual odometry accuracy and generalization by training a differentiable patch selector together with multi-task features and inverse-depth bundle adjustment.

  16. Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas

    cs.CV 2026-03 unverdicted novelty 6.0

    Stepper uses stepwise panoramic expansion with a multi-view 360-degree diffusion model and geometry reconstruction to produce high-fidelity, structurally consistent immersive 3D scenes from text.

  17. ReorgGS: Equivalent Distribution Reorganization for 3D Gaussian Splatting

    cs.CV 2026-05 unverdicted novelty 5.0

    ReorgGS reorganizes the Gaussian distribution in converged 3DGS models by resampling centers and covariances to reduce parameterization degeneration and enable better subsequent optimization.

  18. Why Domain Matters: A Preliminary Study of Domain Effects in Underwater Object Detection

    cs.CV 2026-04 unverdicted novelty 5.0

    A framework labels underwater images by physical characteristics to group them semantically and evaluate object detection performance across real domain factors.

  19. SMFormer: Empowering Self-supervised Stereo Matching via Foundation Models and Data Augmentation

    cs.CV 2026-04 unverdicted novelty 5.0

    SMFormer achieves state-of-the-art self-supervised stereo matching by using vision foundation models for disturbance-resistant features and data augmentation to enforce output consistency, rivaling or exceeding some s...

  20. Physics-Informed Neural Optimal Control for Precision Immobilization Technique in Emergency Scenarios

    eess.SY 2026-04 unverdicted novelty 5.0

    A distilled physics-informed neural surrogate in a hierarchical optimal control architecture raises simulated PIT success from 63.8% to 76.7% and succeeds in three of four low-speed scaled-vehicle tests.

  21. Qwen-Image Technical Report

    cs.CV 2025-08 unverdicted novelty 5.0

    Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive en...

  22. MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

    cs.CV 2025-07 unverdicted novelty 5.0

    MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.

  23. Seed1.5-VL Technical Report

    cs.CV 2025-05 unverdicted novelty 4.0

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

  24. Depth-Aware Rover: A Study of Edge AI and Monocular Vision for Real-World Implementation

    cs.CV 2026-04 unverdicted novelty 3.0

    Monocular depth estimation with UniDepthV2 on Raspberry Pi enables cost-effective rover navigation, proving more robust than stereo vision in real-world tests at 0.1 FPS depth and 10 FPS detection.

  25. Winner of CVPR2026 NTIRE Challenge on Image Shadow Removal: Semantic and Geometric Guidance for Shadow Removal via Cascaded Refinement

    cs.CV 2026-04 unverdicted novelty 3.0

    A three-stage progressive refinement model guided by DINOv2 semantics and geometric depth/normals cues won the NTIRE 2026 image shadow removal challenge with top scores of 26.68 PSNR and 0.874 SSIM.

  26. Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge

    cs.CV 2026-04 unverdicted novelty 2.0

    The 2025 LPCVC winners demonstrate practical techniques for low-power image classification under varied conditions, open-vocabulary segmentation from text prompts, and monocular depth estimation.

  27. NTIRE 2026 3D Restoration and Reconstruction in Real-world Adverse Conditions: RealX3D Challenge Results

    cs.CV 2026-04 unverdicted novelty 2.0

    The NTIRE 2026 challenge reports measurable progress in 3D reconstruction pipelines that handle real-world low-light and smoke degradation via the RealX3D benchmark.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · cited by 27 Pith papers · 4 internal anchors

  1. [1]

    Mapillary planet-scale depth dataset

    Manuel López Antequera, Pau Gargallo, Markus Hofinger, Samuel Rota Bulò, Yubin Kuang, and Peter Kontschieder. Mapillary planet-scale depth dataset. In ECCV, 2020. 12

  2. [2]

    Do deep nets really need to be deep? In NeurIPS, 2014

    Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In NeurIPS, 2014. 10

  3. [3]

    Probing the 3d awareness of visual foundation models

    Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Probing the 3d awareness of visual foundation models. In CVPR, 2024. 8, 14

  4. [4]

    Beit: Bert pre-training of image transformers

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. In ICLR, 2022. 2, 5, 12, 14

  5. [5]

    Adabins: Depth estimation using adaptive bins

    Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. In CVPR, 2021. 9, 10

  6. [6]

    ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv:2302.12288, 2023. 2, 9, 20

  7. [7]

    1–a model zoo for robust monocular relative depth estimation

    Reiner Birkl, Diana Wofk, and Matthias Müller. Midas v3. 1–a model zoo for robust monocular relative depth estimation. arXiv:2307.14460, 2023. 2, 3, 5, 8, 9, 10, 13, 16

  8. [8]

    A naturalistic open source movie for optical flow evaluation

    Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for optical flow evaluation. In ECCV, 2012. 8, 9, 10, 12, 13, 14

  9. [9]

    Virtual KITTI 2

    Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual kitti 2.arXiv:2001.10773,

  10. [10]

    Learning lightweight object detectors via multi-teacher progressive distillation

    Shengcao Cao, Mengtian Li, James Hays, Deva Ramanan, Yu-Xiong Wang, and Liangyan Gui. Learning lightweight object detectors via multi-teacher progressive distillation. In ICML,

  11. [11]

    Single-image depth perception in the wild

    Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. Single-image depth perception in the wild. In NeurIPS, 2016. 7, 8, 16

  12. [12]

    Vision transformer adapter for dense predictions

    Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. In ICLR, 2023. 12

  13. [13]

    Masked-attention mask transformer for universal image segmentation

    Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022. 12

  14. [14]

    Diml/cvl rgb-d dataset: 2m rgb-d images of natural indoor and outdoor scenes

    Jaehoon Cho, Dongbo Min, Youngjung Kim, and Kwanghoon Sohn. Diml/cvl rgb-d dataset: 2m rgb-d images of natural indoor and outdoor scenes. arXiv:2110.11590, 2021. 4, 10

  15. [15]

    Learning depth estimation for transparent and mirror surfaces

    Alex Costanzino, Pierluigi Zama Ramirez, Matteo Poggi, Fabio Tosi, Stefano Mattoccia, and Luigi Di Stefano. Learning depth estimation for transparent and mirror surfaces. In ICCV,

  16. [16]

    Vision transformers need registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In ICLR, 2024. 14

  17. [17]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR,

  18. [18]

    Depth map prediction from a single image using a multi-scale deep network

    David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In NeurIPS, 2014. 10

  19. [19]

    Deep ordinal regression network for monocular depth estimation

    Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. In CVPR, 2018. 10

  20. [20]

    Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image

    Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. arXiv:2403.12013, 2024. 2, 4, 7, 8, 9, 10 25

  21. [21]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. arXiv:2404.12390, 2024. 8

  22. [22]

    Unsupervised domain adaptation by backpropagation

    Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In ICML, 2015. 10

  23. [23]

    Domain-adversarial training of neural networks

    Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario March, and Victor Lempitsky. Domain-adversarial training of neural networks. JMLR, 2016. 10

  24. [24]

    Vision meets robotics: The kitti dataset

    Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 2013. 7, 8, 9, 10, 12, 13, 14

  25. [25]

    Depthfm: Fast monocular depth estimation with flow matching

    Ming Gui, Johannes S Fischer, Ulrich Prestel, Pingchuan Ma, Dmytro Kotovenko, Olga Grebenkova, Stefan Andreas Baumann, Vincent Tao Hu, and Björn Ommer. Depthfm: Fast monocular depth estimation with flow matching. arXiv:2403.13788, 2024. 1, 2, 4, 8, 10

  26. [26]

    Towards zero-shot scale-aware monocular depth estimation

    Vitor Guizilini, Igor Vasiljevic, Dian Chen, Rare s, Ambrus, , and Adrien Gaidon. Towards zero-shot scale-aware monocular depth estimation. In ICCV, 2023. 2, 3

  27. [27]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv:1503.02531, 2015. 6, 10

  28. [28]

    Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation

    Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. arXiv:2404.15506, 2024. 2, 3, 5, 8

  29. [29]

    One- former: One transformer to rule universal image segmentation

    Jitesh Jain, Jiachen Li, Mang Tik Chiu, Ali Hassani, Nikita Orlov, and Humphrey Shi. One- former: One transformer to rule universal image segmentation. In CVPR, 2023. 12

  30. [30]

    Ddp: Diffusion model for dense visual prediction

    Yuanfeng Ji, Zhe Chen, Enze Xie, Lanqing Hong, Xihui Liu, Zhaoqiang Liu, Tong Lu, Zhenguo Li, and Ping Luo. Ddp: Diffusion model for dense visual prediction. In ICCV, 2023. 12

  31. [31]

    Repurposing diffusion-based image generators for monocular depth estimation

    Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Kon- rad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. In CVPR, 2024. 1, 2, 4, 7, 8, 9, 10, 16, 19

  32. [32]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. TOG, 2023. 2

  33. [33]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In ICCV, 2023. 5, 7, 9, 12, 13, 14, 22

  34. [34]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images

  35. [35]

    The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale

    Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020. 12, 22

  36. [36]

    Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks

    Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In ICMLW, 2013. 10

  37. [37]

    Megadepth: Learning single-view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In CVPR, 2018. 3, 4

  38. [38]

    Patchfusion: An end-to-end tile-based framework for high-resolution monocular metric depth estimation

    Zhenyu Li, Shariq Farooq Bhat, and Peter Wonka. Patchfusion: An end-to-end tile-based framework for high-resolution monocular metric depth estimation. In CVPR, 2024. 2

  39. [39]

    Magicedit: High-fidelity and temporally coherent video editing,

    Jun Hao Liew, Hanshu Yan, Jianfeng Zhang, Zhongcong Xu, and Jiashi Feng. Magicedit: High-fidelity and temporally coherent video editing. arXiv:2308.14749, 2023. 2, 17 26

  40. [40]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV,

  41. [41]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. 8

  42. [42]

    Curvefusion: reconstructing thin structures from rgbd sequences

    Lingjie Liu, Nenglun Chen, Duygu Ceylan, Christian Theobalt, Wenping Wang, and Niloy J Mitra. Curvefusion: reconstructing thin structures from rgbd sequences. TOG, 2018. 2

  43. [43]

    Structured knowledge distillation for semantic segmentation

    Yifan Liu, Ke Chen, Chris Liu, Zengchang Qin, Zhenbo Luo, and Jingdong Wang. Structured knowledge distillation for semantic segmentation. In CVPR, 2019. 10

  44. [44]

    Swin transformer v2: Scaling up capacity and resolution

    Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In CVPR, 2022. 9

  45. [45]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV,

  46. [46]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In CVPR, 2022. 12

  47. [47]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV,

  48. [48]

    Improved knowledge distillation via teacher assistant

    Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant. In AAAI, 2020. 6, 10

  49. [49]

    All in tokens: Unifying output space of visual tasks via soft token

    Jia Ning, Chen Li, Zheng Zhang, Chunyu Wang, Zigang Geng, Qi Dai, Kun He, and Han Hu. All in tokens: Unifying output space of visual tasks via soft token. In ICCV, 2023. 9

  50. [50]

    Dinov2: Learning robust visual features without supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. TMLR, 2023. 2, 5, 13, 14

  51. [51]

    P3depth: Monocular depth estimation with a piecewise planarity prior

    Vaishakh Patil, Christos Sakaridis, Alexander Liniger, and Luc Van Gool. P3depth: Monocular depth estimation with a piecewise planarity prior. In CVPR, 2022. 9

  52. [52]

    Unidepth: Universal monocular metric depth estimation

    Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. In CVPR, 2024. 2

  53. [53]

    Unrealcv: Virtual worlds for computer vision

    Weichao Qiu, Fangwei Zhong, Yi Zhang, Siyuan Qiao, Zihao Xiao, Tae Soo Kim, and Yizhou Wang. Unrealcv: Virtual worlds for computer vision. In ACM MM, 2017. 4

  54. [54]

    Open challenges in deep stereo: the booster dataset

    Pierluigi Zama Ramirez, Fabio Tosi, Matteo Poggi, Samuele Salti, Stefano Mattoccia, and Luigi Di Stefano. Open challenges in deep stereo: the booster dataset. In CVPR, 2022. 4, 13

  55. [55]

    Vision transformers for dense predic- tion

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense predic- tion. In ICCV, 2021. 8, 9, 10

  56. [56]

    Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer

    René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. TPAMI, 2022. 2, 3, 6, 10, 13, 14

  57. [57]

    Playing for data: Ground truth from computer games

    Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data: Ground truth from computer games. In ECCV, 2016. 4

  58. [58]

    Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding

    Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In ICCV, 2021. 4, 9, 12, 16 27

  59. [59]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022. 2, 10

  60. [60]

    Imagenet large scale visual recognition challenge

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015. 12, 22

  61. [61]

    Learning from synthetic data: Addressing domain shift for semantic segmentation

    Swami Sankaranarayanan, Yogesh Balaji, Arpit Jain, Ser Nam Lim, and Rama Chellappa. Learning from synthetic data: Addressing domain shift for semantic segmentation. In CVPR,

  62. [62]

    A multi-view stereo benchmark with high-resolution images and multi-camera videos

    Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In CVPR, 2017. 8, 9, 10, 12, 13, 14

  63. [63]

    Airsim: High-fidelity visual and physical simulation for autonomous vehicles

    Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics, 2017. 4

  64. [64]

    Inserf: Text-driven generative object insertion in neural 3d scenes

    Mohamad Shahbazi, Liesbeth Claessens, Michael Niemeyer, Edo Collins, Alessio Tonioni, Luc Van Gool, and Federico Tombari. Inserf: Text-driven generative object insertion in neural 3d scenes. arXiv:2401.05335, 2024. 2

  65. [65]

    Objects365: A large-scale, high-quality dataset for object detection

    Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019. 12, 22

  66. [66]

    Nddepth: Normal-distance assisted monocular depth estimation

    Shuwei Shao, Zhongcai Pei, Weihai Chen, Xingming Wu, and Zhengguo Li. Nddepth: Normal-distance assisted monocular depth estimation. In ICCV, 2023. 9

  67. [67]

    Iebins: Iterative elastic bins for monocular depth estimation

    Shuwei Shao, Zhongcai Pei, Xingming Wu, Zhong Liu, Weihai Chen, and Zhengguo Li. Iebins: Iterative elastic bins for monocular depth estimation. In NeurIPS, 2023. 9

  68. [68]

    Realm- dreamer: Text-driven 3d scene generation with inpainting and depth diffusion,

    Jaidev Shriram, Alex Trevithick, Lingjie Liu, and Ravi Ramamoorthi. Realmdreamer: Text- driven 3d scene generation with inpainting and depth diffusion. arXiv:2404.07199, 2024. 2

  69. [69]

    Channel-wise knowledge distillation for dense prediction

    Changyong Shu, Yifan Liu, Jianfei Gao, Zheng Yan, and Chunhua Shen. Channel-wise knowledge distillation for dense prediction. In ICCV, 2021. 10

  70. [70]

    Indoor segmentation and support inference from rgbd images

    Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012. 3, 7, 8, 9, 10, 12, 13, 14, 16

  71. [71]

    Fixmatch: Simplifying semi- supervised learning with consistency and confidence

    Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi- supervised learning with consistency and confidence. In NeurIPS, 2020. 10

  72. [72]

    The third monocular depth estimation challenge

    Jaime Spencer, Fabio Tosi, Matteo Poggi, Ripudaman Singh Arora, Chris Russell, Simon Hadfield, Richard Bowden, GuangYuan Zhou, ZhengXin Li, Qiang Rao, et al. The third monocular depth estimation challenge. arXiv:2404.16831, 2024. 2, 6

  73. [73]

    Training very deep networks

    Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training very deep networks. In NeurIPS, 2015. 10

  74. [74]

    Segmenter: Transformer for semantic segmentation

    Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmentation. In ICCV, 2021. 12

  75. [75]

    Learning vision from models rivals learning vision from data

    Yonglong Tian, Lijie Fan, Kaifeng Chen, Dina Katabi, Dilip Krishnan, and Phillip Isola. Learning vision from models rivals learning vision from data. In CVPR, 2024. 5, 14

  76. [76]

    Dai, Andrea F

    Igor Vasiljevic, Nick Kolkin, Shanyi Zhang, Ruotian Luo, Haochen Wang, Falcon Z Dai, Andrea F Daniele, Mohammadreza Mostajabi, Steven Basart, Matthew R Walter, et al. Diode: A dense indoor and outdoor depth dataset. arXiv:1908.00463, 2019. 8, 9, 10, 12, 13, 14 28

  77. [77]

    Irs: A large naturalistic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation

    Qiang Wang, Shizhen Zheng, Qingsong Yan, Fei Deng, Kaiyong Zhao, and Xiaowen Chu. Irs: A large naturalistic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation. In ICME, 2021. 12

  78. [78]

    Internimage: Exploring large-scale vision foundation models with deformable convolutions

    Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In CVPR, 2023. 12

  79. [79]

    Tartanair: A dataset to push the limits of visual slam

    Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. In IROS, 2020. 12

  80. [80]

    Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving

    Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariharan, Mark Campbell, and Kilian Q Weinberger. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In CVPR, 2019. 2

Showing first 80 references.